Software, Data, & Standards

Open-source tools and standards for genomics and computational biology

We develop open-source software and standards organized into three major ecosystems, each with comprehensive documentation and tools for genomics research. Older tools that are no longer maintained or have been superseded by newer projects are archived and listed separately.
View archived tools → View download statistics →

Major Ecosystems

Most of the software produced by the Databio Lab is organized into three major ecosystems. Each ecosystem includes a suite of tools that work together to provide a comprehensive solution for their respective domains.

An ecosystem for analyzing, managing, and searching genomic interval (BED) data. BEDbase provides tools for enrichment analysis, overlap testing, statistical characterization, and machine learning on genomic regions. It includes high-performance data structures for interval search, both in Rust and C, with bindings for Python and R, plus databases for storing and querying large collections of region sets.

A comprehensive suite of tools for managing sample metadata and running bioinformatics pipelines. PEPkit defines the Portable Encapsulated Projects (PEP) standard, providing a universal format for sample annotation tables that works across tools, pipelines, and computing environments. Includes tools for reading, validating, and sharing sample metadata, submitting pipelines, and fetching data from public repositories like GEO and SRA.

A reference genome asset management system that organizes, retrieves, and shares genome resources. Refgenie provides a command-line and Python interface to download pre-built reference genome assets like aligner indexes, and can build custom assets for any genome assembly. It includes a server for hosting and distributing assets, plus tools implementing the GA4GH refget standard for accessing reference sequences by unique identifiers.

All Projects

Name Description Ecosystem Downloads
A Python package that provides an API for handling standardized project and sample metadata. If you define your project in Portable Encapsulated Project (PEP) format, you can use the peppy package to instantiate an in-memory representation of your project and sample metadata. You can then use peppy for interactive analysis, or to develop Python tools so you don't have to handle sample processing. Peppy is useful to tool developers and data analysts who want a standard way of representing sample-intensive research project metadata. PEPkit
Refgenie is full-service reference genome manager that organizes storage, access, and transfer of reference genomes. It provides command-line and Python interfaces to download pre-built reference genome "assets" like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie
AIList
Augmented Interval List is a data structure with the fastest currently known algorithm for searching for genomic overlaps between two sets of genomic ranges with high containment. BEDbase
BiocProject
An R package for integrating PEPs with other data structures PEPkit
A multi-container computing environment manager. A bulker environment consists of an individual container image for each command. Bulker environments are portable, interactive, and independent of any specific workflow. Bulker simplifies both interactive analysis and workflow development by building drop-in replacements to command-line tools that act like native tools, but run in containers. Think of bulker as a lightweight wrapper for docker/singularity to simplify sharing complete, containerized environments. PEPkit
Coordinate Covariation Analysis. Identifying sources of intersample variation using PCA and region sets for genomic coordinate-based data. Standalone
dnameth
Pipelines for Whole Genome and Reduced Representation Bisulfite-seq. Standalone
Calculate and plot distributions of genomic ranges. BEDbase
A command-line tool that downloads sequencing data and metadata from GEO and SRA and creates standard sample metadata tables in PEP format. PEPkit
Genomic Locus Overlap Analysis. An R package for enrichment analysis of genomic ranges. Given an input set of genomic regions and a database of genomic region sets, LOLA will compute overlaps and return a list of database region sets ranked by similarity. BEDbase
A pipeline submitting engine. Looper deploys any command-line pipeline for each sample in a project organized in standard sample metadata format (PEP). You can think of looper as providing a single user interface to running, summarizing, monitoring, and otherwise managing all of your sample-intensive research projects the same way, regardless of data type or pipeline used. PEPkit
A Bioconductor package for inferring regulatory activity from DNA methylation data. Standalone
PEPATAC is an ATAC-seq pipeline. It trims adapters, maps reads, calls peaks, and creates bigwig tracks, TSS enrichment files, and other outputs. It is optimized on unique features of ATAC-seq data to be fast and accurate and provides several unique analytical approaches. Standalone
PEPPRO is a pipeline designed to process PRO-seq data. It is optimized on unique features of PRO-seq to be fast and accurate. It performs adapter removal, including UMI of variable length, read deduplication, trimming, mapping, and signal tracks (bigWig) for plus and minus strands using scaled (based on mappability information) or unscaled read count patterns. Standalone
An R package for interfacing with sample metadata in PEP format. PEPkit
Pypiper is a development-oriented pipeline framework. It is a Python package that helps you write robust pipelines directly in Python, handling mundane tasks like restartability, monitoring for time and memory use, monitoring job status, copious log output, robust error handling, easy debugging tools, and guaranteed file output integrity. PEPkit
The refget package provides a Python interface to both remote and local use of the refget protocol. Refgenie
geniml
Genomic interval machine learning toolkit. Does interesting stuff to BED files. BEDbase
Performance-critical tools to manipulate, analyze, and process genomic interval data. Provides Rust crates, a CLI, and Python/R bindings for genomic interval operations. BEDbase
GA web API and database for biological sample metadata. PEPkit
Pipestat PEPkit
Yacman is a YAML configuration manager. It provides convenience functions for Python developers dealing with YAML configuration files. PEPkit
PEP specification
SPEC
A formal specification for portable encapsulated projects, defining a standardized metadata format for sample-intensive research. PEPkit
Refget specification
SPEC
A GA4GH standard for accessing reference sequences and collections by unique checksum identifiers. Refgenie