Posts

Recently, I needed to run a process on an HPC cluster that required a secret, but I wanted to avoid storing my private key as a file on the cluster for security reasons. Instead, I looked for a way to decrypt an encrypted secret on the HPC while keeping my private key securely on my local machine. A great solution for this is GPG agent forwarding, which allows a remote machine to use a local GPG agent to decrypt secrets. This worked well when I could log into a single head node, but it broke when my HPC cluster implemented a load balancer that assigned me to a random node each time I logged in. The typical approach -- deleting the existing agent socket and reconnecting -- became unreliable. This post explains the problem in detail, walks through several failed solutions, and ultimately presents the working method I found to maintain secure GPG agent forwarding even when connecting through a randomized load balancer.

I use pandoc for converting markdown into PDF. Using LaTeX templates, I can make nice-looking PDFs output that, in my opinion, look as good as a professional publication. But one challenge I've struggled with is how to handle supplementary citations. Here, I outline the problem and describe a lua filter I wrote to solve it.

Video conferencing is on the rise, but people are still learning how to use it. It's convenient in many ways, but also comes with some new challenges that make it harder to communicate than talking face-to-face. In this post I share some specific things I've learned about how to improve communication and reduce fatigue when using video technology.

Want to switch from legacy HPC environment modules and upgrade to linux containers, so you can re-use them on multiple servers? Here's how I did it.

What's next for the revolution in scientific publishing? The scientific publishing industry is facing challenges to pillars of tradition like peer review and journal subscription. In the face of these challenges, the traditional roles of scientific publishers are deteriorating. New tools are now putting scientific publication within reach of the masses, leading to an explosion of new journals and funding models. As the material cost to publish a scientific paper approaches zero, what stops a scientist from simply self-publishing research? Here, I present my experience building a basic self-publishing system and my perspective on the changing modes of scholarly publication.

A philosophical exploration of Darwin's 'survival of the fittest,' arguing that fitness isn't merely about leaving offspring, but represents the nonrandomness of success—challenging Waddington's claim that natural selection is a tautology.

How to manage large-scale genomic data storage by matching different storage systems to different data needs, using environment variables and storage classes to balance cost, performance, and features.

Many common bioinformatic tasks are embarrassingly parallel, and there are many ways to parallelize. The way you decide to parallelize will affect both performance and developer cost, and choosing the best way in practice depends on the specifics of the project. To implement parallelism in a project usually requires significant developer resources, and unfortunately, in my opinion these resources are not always well-spent. This post summarizes my experience with parallelism in bioinformatics into a few useful concepts to guide how to employ it to the greatest benefit in terms of both computational efficiency and respect for developer time.

How separating content from style—a fundamental principle in CSS—can improve scientific productivity through markdown, enabling portable, reusable content across papers, presentations, CVs, and web pages.

Why shell scripts are great for simple tasks but become unmaintainable for complex programs—and why Python is almost always the better choice for anything requiring functions or control structures.

How to configure sratoolkit to download SRA files to a shared filesystem instead of your home directory, and how to manage the large temporary .sra files that accumulate over time.

Using git's push-to-deploy feature to push code changes directly to a server's non-bare repository, automatically updating working files—perfect for deploying websites and maintaining read-only code repositories on remote servers.

A Python tool that creates standardized folder structures for reference genomes and their indexes, making NGS pipelines portable across environments and enabling easy support for custom genomes beyond iGenomes.

A personal essay on why open source software matters—not for cost savings, but for data freedom and control, illustrated through the author's experience with proprietary finance software versus GNUCash.

How to use Tabix-indexed files with pysam to efficiently parallelize Python processing of large genomic datasets without duplicating data in memory across multiple processes.

A lightweight Python class that converts nested dictionaries into objects with attribute-style access (object.value instead of object['value']), perfect for working with YAML configuration files.

Announcing LOLA, a Bioconductor package for genomic locus overlap enrichment analysis—like GSEA for genomic regions—with curated databases from ENCODE, Roadmap Epigenomics, and other public sources.

A guide to creating modular, web-shareable presentations using reveal.js and Jekyll, combining the power of HTML presentations with templating engines for reusable components and consistent styling across multiple presentations.

Simplify R package testing with Docker containers. Run R CMD check and BiocCheck without installing dependencies using a pre-configured Docker image with all necessary prerequisites.

Exploring text-based citation management with JabRef and leveraging DOI-based tools like CrossRef's citation formatter for a more stable, programmatic approach to managing academic references.

Quick guide on how to save Docker images as tar archives and transfer them between computers without using Docker Hub, using docker save and docker load commands.

Introducing a new Jekyll-based site for sharing code tutorials, software development, and data resources in computational biology and bioinformatics.

A fast bash script for estimating line counts in massive files by sampling the first lines and extrapolating based on file size, providing quick approximations without processing entire files.

An introduction to chromosomal breakpoints—the locations where chromosomes break and reattach during recombination, and how abnormal reattachment can lead to translocations like the Philadelphia chromosome.

A clear explanation of allelic imbalance—when the two copies of a gene (maternal and paternal) are expressed at different levels, caused by factors like gene imprinting or cis-acting mutations.

Posts

Using local GPG private keys on HPC with load balancer

Handling supplemental citations with pandoc

Zooming in: making the most of video technology in an increasingly virtual world

From HPC environment modules to linux containers

The democratization of scientific publication

Success is nonrandom

The curse of enormity: Disk space and data-intensive biology

Parallelism in bioinformatics

On content and style: the beauty of markdown

Stop writing shell scripts!

The default path for downloading SRA data

Push-to-deploy: A nice git workflow for updating server code

RefGenie: Standardized reference genome folder structure for NGS pipelines

Open software and data freedom

Efficient parallel processing in python with Tabix and MapReduce

Python AttributeDict for object-style attributes on a dict

LOLA: Locus overlap analysis for enrichment of genomic ranges

Modular presentations with reveal.js and Jekyll

Running R CMD check in Docker

Managing citations

Saving docker images locally

Welcome

Word Count Line Estimate

What are chromosomal breakpoints?

What is allelic imbalance?