Recently, I needed to run a process on an HPC cluster that required a secret, but I wanted to avoid storing my private key as a file on the cluster for security reasons. Instead, I looked for a way to decrypt an encrypted secret on the HPC while keeping my private key securely on my local machine. A great solution for this is GPG agent forwarding, which allows a remote machine to use a local GPG agent to decrypt secrets. This worked well when I could log into a single head node, but it broke when my HPC cluster implemented a load balancer that assigned me to a random node each time I logged in. The typical approach -- deleting the existing agent socket and reconnecting -- became unreliable. This post explains the problem in detail, walks through several failed solutions, and ultimately presents the working method I found to maintain secure GPG agent forwarding even when connecting through a randomized load balancer.
I use pandoc for converting markdown into PDF. Using LaTeX templates, I can make nice-looking PDFs output that, in my opinion, look as good as a professional publication. But one challenge I've struggled with is how to handle supplementary citations. Here, I outline the problem and describe a lua filter I wrote to solve it.
Video conferencing is on the rise, but people are still learning how to use it. It's convenient in many ways, but also comes with some new challenges that make it harder to communicate than talking face-to-face. In this post I share some specific things I've learned about how to improve communication and reduce fatigue when using video technology.
From HPC environment modules to linux containers
Want to switch from legacy HPC environment modules and upgrade to linux containers, so you can re-use them on multiple servers? Here's how I did it.
What's next for the revolution in scientific publishing? The scientific publishing industry is facing challenges to pillars of tradition like peer review and journal subscription. In the face of these challenges, the traditional roles of scientific publishers are deteriorating. New tools are now putting scientific publication within reach of the masses, leading to an explosion of new journals and funding models. As the material cost to publish a scientific paper approaches zero, what stops a scientist from simply self-publishing research? Here, I present my experience building a basic self-publishing system and my perspective on the changing modes of scholarly publication.
Success is nonrandom
A philosophical exploration of Darwin's 'survival of the fittest,' arguing that fitness isn't merely about leaving offspring, but represents the nonrandomness of success—challenging Waddington's claim that natural selection is a tautology.
The curse of enormity: Disk space and data-intensive biology
How to manage large-scale genomic data storage by matching different storage systems to different data needs, using environment variables and storage classes to balance cost, performance, and features.
Parallelism in bioinformatics
Many common bioinformatic tasks are embarrassingly parallel, and there are many ways to parallelize. The way you decide to parallelize will affect both performance and developer cost, and choosing the best way in practice depends on the specifics of the project. To implement parallelism in a project usually requires significant developer resources, and unfortunately, in my opinion these resources are not always well-spent. This post summarizes my experience with parallelism in bioinformatics into a few useful concepts to guide how to employ it to the greatest benefit in terms of both computational efficiency and respect for developer time.
On content and style: the beauty of markdown
How separating content from style—a fundamental principle in CSS—can improve scientific productivity through markdown, enabling portable, reusable content across papers, presentations, CVs, and web pages.
Stop writing shell scripts!
Why shell scripts are great for simple tasks but become unmaintainable for complex programs—and why Python is almost always the better choice for anything requiring functions or control structures.
The default path for downloading SRA data
How to configure sratoolkit to download SRA files to a shared filesystem instead of your home directory, and how to manage the large temporary .sra files that accumulate over time.
Push-to-deploy: A nice git workflow for updating server code
Using git's push-to-deploy feature to push code changes directly to a server's non-bare repository, automatically updating working files—perfect for deploying websites and maintaining read-only code repositories on remote servers.
RefGenie: Standardized reference genome folder structure for NGS pipelines
A Python tool that creates standardized folder structures for reference genomes and their indexes, making NGS pipelines portable across environments and enabling easy support for custom genomes beyond iGenomes.
Open software and data freedom
A personal essay on why open source software matters—not for cost savings, but for data freedom and control, illustrated through the author's experience with proprietary finance software versus GNUCash.
Efficient parallel processing in python with Tabix and MapReduce
How to use Tabix-indexed files with pysam to efficiently parallelize Python processing of large genomic datasets without duplicating data in memory across multiple processes.
Python AttributeDict for object-style attributes on a dict
A lightweight Python class that converts nested dictionaries into objects with attribute-style access (object.value instead of object['value']), perfect for working with YAML configuration files.
Announcing LOLA, a Bioconductor package for genomic locus overlap enrichment analysis—like GSEA for genomic regions—with curated databases from ENCODE, Roadmap Epigenomics, and other public sources.
Modular presentations with reveal.js and Jekyll
A guide to creating modular, web-shareable presentations using reveal.js and Jekyll, combining the power of HTML presentations with templating engines for reusable components and consistent styling across multiple presentations.
Running R CMD check in Docker
Simplify R package testing with Docker containers. Run R CMD check and BiocCheck without installing dependencies using a pre-configured Docker image with all necessary prerequisites.
Managing citations
Exploring text-based citation management with JabRef and leveraging DOI-based tools like CrossRef's citation formatter for a more stable, programmatic approach to managing academic references.
Saving docker images locally
Quick guide on how to save Docker images as tar archives and transfer them between computers without using Docker Hub, using docker save and docker load commands.
Welcome
Introducing a new Jekyll-based site for sharing code tutorials, software development, and data resources in computational biology and bioinformatics.
Word Count Line Estimate
A fast bash script for estimating line counts in massive files by sampling the first lines and extrapolating based on file size, providing quick approximations without processing entire files.
What are chromosomal breakpoints?
An introduction to chromosomal breakpoints—the locations where chromosomes break and reattach during recombination, and how abnormal reattachment can lead to translocations like the Philadelphia chromosome.
What is allelic imbalance?
A clear explanation of allelic imbalance—when the two copies of a gene (maternal and paternal) are expressed at different levels, caused by factors like gene imprinting or cis-acting mutations.