Slides

An animated walkthrough of DNA sequencing from cheek swab to interpretation — covering cells, chromosomes, alleles, variants, and Punnett squares.

outreach

Overview of the GKS schema system, the GA4GH Schema Registry specification, key tension points, and a proposed static registry site

research

How the Model Context Protocol could enable AI assistants to query BEDbase's 130,000+ genomic region datasets

research

An exploration of identifiers in bioinformatics, covering what identifiers are, their scope from local to global, the pros and cons of authority-based identifiers versus content-derived identifiers, and a proposed priority order for choosing identifier types for maximum interoperability.

researchshort

In this talk I will introduce a philosophical framework for communication, and then 5 specific principles of applying that philosophy to data visualization can improve our communication.

skills

Assays of the human epigenome capture the regulatory state of cells in health and disease. With tens of thousands of experiments completed, it is now feasible to train large-scale representation models. In this talk, I will share how we approach this challenge by first abstracting all epigenome data into genomic intervals. I will describe several types of neural embedding models we use to investigate what biological questions can be addressed with epigenome region embeddings. I will also describe our efforts to curate, standardize, annotate, and share genomic interval data to make it more broadly useful for machine learning and beyond.

research

Recruitment presentation for UVA's Computational Biology PhD program, covering coursework, timeline, resources, and what makes UVA's approach to computational biology unique.

outreach

Introduction to PEPkit, a suite of software that helps make sample metadata reusable.

short

This presentation provides an introduction to several computational methods for epigenetics analysis developed in the Sheffield lab, including LOLA (Locus Overlap Analysis), MIRA (Methylation-based Inference of Regulatory Activity), and COCOA (Coordinate Covariation Analysis).

research

This presentation covers the fundamentals of building computational analysis pipelines, from interactive computing to shell scripts and pipeline frameworks. It introduces the Sheffield Lab's modular pipeline engineering ecosystem, including tools like PEP, pypiper, looper, pipestat, refgenie, and others that enable flexible, interoperable bioinformatics workflows.

research

A talk about refgenie, sequence collections, and pangenomes.

research

This talk presents novel machine learning approaches for analysis of genomic regions, including methods for interval set comparison, vector representations of genomic region sets, and building interval universes.

research

This presentation explores methods for analyzing genomic intervals, from data structures for efficient comparison to novel machine learning approaches for region embeddings. It covers background on genomic intervals, LOLA for enrichment analysis, AIList and IGD for fast interval operations, RegionSet2vec for machine learning-based analysis, and BEDbase as a data repository.

research

Most GWAS associations have been found outside protein-coding genes, leading to growing interest in studying non-coding DNA to understand biology and treat disease. In this talk, I'll present my group's recent work on methods development for analysis of genomic intervals. Through algorithmic advances and novel machine learning applications, we have developed several approaches to study non-coding genomic intervals in new ways. I'll also present applications and results of these approaches to recent questions in cancer biology.

research

Introductions to epigenomics and interval analysis projects in the lab

lecture

Sequence collections extend the refget standard to provide unique identifiers for collections of sequences like reference genomes. This presentation covers the problem of genome variability across providers, the digest algorithm for creating deterministic identifiers, and the comparison API for assessing compatibility between different genome assemblies.

short

Presentation of several recent developments for the UNC Human Functional Genomics Group. The PEPATAC ATAC-seq pipeline, Sequence Collections, and RegionSet2Vec.

research

Philosophy of how to read a paper

skills

An overview of several Sheffield lab projects for new algorithms, machine learning approaches, and databases to handle the growing corpus of genomic interval data (BED files).

research

Collected introductions to a variety of lab projects: LOLA, MIRA, COCOA, Refgenie, PEP, bulker, etc.

short

Two complementary approaches to working with Common Workflow Language (CWL) workflows: First, using Looper to scatter CWL workflows across tabular sample data, solving the problem of running workflows on CSV sample tables. Second, using Bulker to create portable, interactive computing environments from CWL tool definitions, enabling troubleshooting and interactive analysis with the same containerized tools used in workflows.

lecture

This lightning talk introduces refgenie, a reference genome asset management system, with a focus on how to use it from within R and bioconductor workflows. Refgenie simplifies the process of obtaining and managing genome reference assets.

short

This lecture introduces the concept of computational pipelines for processing sequencing data, with specific focus on PEPATAC (PEP-compatible ATAC-seq pipeline). Covers pipeline basics, ATAC-seq specific considerations, and quality control metrics for ATAC-seq data analysis.

lecture

Introduction to chromatin accessibility and regulatory DNA. Covers the biological basis of open chromatin regions, various protocols for measuring accessibility (DNase-seq, FAIRE-seq, ATAC-seq), and basic computational analysis approaches. ATAC-seq uses Tn5 transposase to identify accessible chromatin regions genome-wide.

lecture

Three complementary tools that address common bioinformatics challenges: Refgenie manages reference genome resources, PEP (Portable Encapsulated Projects) standardizes sample metadata and project organization, and Bulker provides containerized computing environments. Together, these tools create a reproducible computational biology workflow.

lecture

This presentation introduces refgenie, a reference genome asset manager that provides both a command-line interface and server-based distribution of genome assets. It covers the components of refgenie, the refget API implementation, and how collection checksums provide genome provenance tracking.

short

BiocProject integrates Portable Encapsulated Projects (PEP) with Bioconductor, providing automated data loading, functions for interacting with project metadata, and PEP-annotated Bioconductor data objects.

short

COCOA (Coordinate Covariation Analysis) provides a method to understand continuous regulatory variation in high-dimensional epigenetic data by quantifying variation with PCA and annotating principal components with region sets.

short

DNA methylation is a covalent modification that plays critical roles in gene regulation, development, and disease. This lecture covers the biology of DNA methylation including its distribution patterns, heritability, and clinical relevance. The bisulfite sequencing section introduces techniques for measuring methylation including RRBS and WGBS, alignment challenges, and the time-scale balance between regulation and epigenetic memory.

lecture

A historical and philosophical introduction to epigenetics and epigenomics, exploring various definitions and perspectives on the field.

lecture

Comprehensive introduction to two major epigenomics technologies for measuring the regulatory genome: ATAC-seq for chromatin accessibility and Bisulfite-seq for DNA methylation. Covers the biological basis of each technology, experimental protocols, data analysis approaches, and advanced concepts including sample pooling and integrative multi-omics analysis. Includes introduction to LOLA for genomic region enrichment analysis.

lecture

I will outline a motivation and implementation of PEP, an open toolkit for organizing large-scale, sample-intensive biological research projects around a standardized structure. I will describe a series of modular tools that fit together to enable a scientist to process genome sequencing data from raw format through processed output. I will then describe our group's work on novel tools and algorithms for interpreting epigenetic signals via aggregating collections of DNA regulatory elements.

lecture

Discussion of the layers of collaboration in software development and introduction to git and GitHub

skills

An introduction to PEP, and brief overview of PEPATAC, LOLA, and MIRA

lecture

A discussion of standardizing project metadata with Portable Encapsulated Projects and a presentation of PEPATAC, an ATAC-seq pipeline that reads standard PEP projects.

research

PEPATAC is an optimized ATAC-seq analysis pipeline that uses serial alignments to handle mitochondrial DNA contamination. This short presentation covers the modular design, prealignment strategy, and practical usage of PEPATAC for ATAC-seq analysis.

short

Presentation of the AIList (Augmented Interval List) algorithm for efficient genomic interval overlap queries, comparing its performance to existing approaches like R-trees and Nested Containment Lists.

short

LOLA (Locus Overlap Analysis) is a computational method for enrichment analysis of genomic ranges. It quantifies and visualizes the overlap between a query region set and a database of reference region sets, enabling biological interpretation of genomic intervals.

short

Short talk introducing the LOLA bioconductor package.

short

MIRA is a bioconductor package for analyzing DNA methylation data to infer regulatory activity. This presentation introduces the concept of region pooling and how MIRA uses bisulfite sequencing data to understand regulatory elements.

short

The Portable Encapsulated Project (PEP) specification provides a standardized way to organize sample metadata and project configuration for computational biology. This presentation introduces the PEP format, its motivation (the 'microwave syndrome' of incompatible sample metadata), and demonstrates the pepkit ecosystem including Python (peppy) and R (pepr) implementations. Covers project organization, sample modifiers, and schema validation.

lecture

In this presentation I will start by introducing two R packages for epigenome analysis. First, LOLA (Locus Overlap Analysis), which identifies enrichments of genomic ranges in public databases, linking newly generated epigenome data to large existing data sets. Second, MIRA (Methylation-based Inference of Regulatory Activity), which uses aggregate DNA methylation patterns data to infer regulatory activity at collections of regions of interest. I will demonstrate the utility of these tools in understanding the rare pediatric cancer Ewing sarcoma. Finally, I will present pepkit, an open toolkit for organizing large-scale, sample-intensive biological research projects around a standardized structure called Portable Encapsulated Projects.

lecture

Pypiper is a user-friendly large-data informatics pipeline framework. Looper is a project manager that distributes jobs across a cluster for you. This presentation covers Jordan's story of pipeline development, from initial manual processing to scripting, error handling, scaling up to 500 samples, and finally to a comprehensive pipeline management system using Pypiper and Looper together.

lecture

Scientific writing clarity stems not from topic complexity, but from four common problems: subjects and verbs too far apart, overabundance of nominalizations, poor information flow (old-to-new), and excessive passive voice. This presentation teaches revision techniques including omitting needless words, putting actions in verbs, using summarizing nominalizations, placing verbs near subjects, and putting familiar information first. Includes numerous examples and practice exercises.

skills

Intro and literature review of challenges and approaches to single-cell epigenomics analysis. Presentation for the 2017 American Foundation for AIDS Research Think Tank Meeting.

lecture

Exploring DNA methylation data in a rare pediatric cancer.

research

Brief introduction to research in the group

research

Docker provides version-controlled, reproducible computing environments by packaging applications with all dependencies. This presentation covers Docker basics, terminology, and Dockerfile construction. Three use cases demonstrate practical applications: containerizing R CMD check and BiocCheck for package development, packaging analyses as deployable applications with ENTRYPOINT, and maintaining personal/team R containers for working from anywhere. Includes live demonstrations.

lecture