How DNA Sequencing Works
• By Nathan Sheffield
An animated walkthrough of DNA sequencing from cheek swab to interpretation — covering cells, chromosomes, alleles, variants, and Punnett squares.
outreach
Presentations from conferences, workshops, and seminars
An animated walkthrough of DNA sequencing from cheek swab to interpretation — covering cells, chromosomes, alleles, variants, and Punnett squares.
Overview of the GKS schema system, the GA4GH Schema Registry specification, key tension points, and a proposed static registry site
How the Model Context Protocol could enable AI assistants to query BEDbase's 130,000+ genomic region datasets
An exploration of identifiers in bioinformatics, covering what identifiers are, their scope from local to global, the pros and cons of authority-based identifiers versus content-derived identifiers, and a proposed priority order for choosing identifier types for maximum interoperability.
In this talk I will introduce a philosophical framework for communication, and then 5 specific principles of applying that philosophy to data visualization can improve our communication.
Assays of the human epigenome capture the regulatory state of cells in health and disease. With tens of thousands of experiments completed, it is now feasible to train large-scale representation models. In this talk, I will share how we approach this challenge by first abstracting all epigenome data into genomic intervals. I will describe several types of neural embedding models we use to investigate what biological questions can be addressed with epigenome region embeddings. I will also describe our efforts to curate, standardize, annotate, and share genomic interval data to make it more broadly useful for machine learning and beyond.
Recruitment presentation for UVA's Computational Biology PhD program, covering coursework, timeline, resources, and what makes UVA's approach to computational biology unique.
Introduction to PEPkit, a suite of software that helps make sample metadata reusable.
This presentation provides an introduction to several computational methods for epigenetics analysis developed in the Sheffield lab, including LOLA (Locus Overlap Analysis), MIRA (Methylation-based Inference of Regulatory Activity), and COCOA (Coordinate Covariation Analysis).
This presentation covers the fundamentals of building computational analysis pipelines, from interactive computing to shell scripts and pipeline frameworks. It introduces the Sheffield Lab's modular pipeline engineering ecosystem, including tools like PEP, pypiper, looper, pipestat, refgenie, and others that enable flexible, interoperable bioinformatics workflows.
A talk about refgenie, sequence collections, and pangenomes.
This talk presents novel machine learning approaches for analysis of genomic regions, including methods for interval set comparison, vector representations of genomic region sets, and building interval universes.
This presentation explores methods for analyzing genomic intervals, from data structures for efficient comparison to novel machine learning approaches for region embeddings. It covers background on genomic intervals, LOLA for enrichment analysis, AIList and IGD for fast interval operations, RegionSet2vec for machine learning-based analysis, and BEDbase as a data repository.
Most GWAS associations have been found outside protein-coding genes, leading to growing interest in studying non-coding DNA to understand biology and treat disease. In this talk, I'll present my group's recent work on methods development for analysis of genomic intervals. Through algorithmic advances and novel machine learning applications, we have developed several approaches to study non-coding genomic intervals in new ways. I'll also present applications and results of these approaches to recent questions in cancer biology.
Introductions to epigenomics and interval analysis projects in the lab
Sequence collections extend the refget standard to provide unique identifiers for collections of sequences like reference genomes. This presentation covers the problem of genome variability across providers, the digest algorithm for creating deterministic identifiers, and the comparison API for assessing compatibility between different genome assemblies.
Presentation of several recent developments for the UNC Human Functional Genomics Group. The PEPATAC ATAC-seq pipeline, Sequence Collections, and RegionSet2Vec.
Philosophy of how to read a paper
An overview of several Sheffield lab projects for new algorithms, machine learning approaches, and databases to handle the growing corpus of genomic interval data (BED files).
Collected introductions to a variety of lab projects: LOLA, MIRA, COCOA, Refgenie, PEP, bulker, etc.
Two complementary approaches to working with Common Workflow Language (CWL) workflows: First, using Looper to scatter CWL workflows across tabular sample data, solving the problem of running workflows on CSV sample tables. Second, using Bulker to create portable, interactive computing environments from CWL tool definitions, enabling troubleshooting and interactive analysis with the same containerized tools used in workflows.
This lightning talk introduces refgenie, a reference genome asset management system, with a focus on how to use it from within R and bioconductor workflows. Refgenie simplifies the process of obtaining and managing genome reference assets.
This lecture introduces the concept of computational pipelines for processing sequencing data, with specific focus on PEPATAC (PEP-compatible ATAC-seq pipeline). Covers pipeline basics, ATAC-seq specific considerations, and quality control metrics for ATAC-seq data analysis.
Introduction to chromatin accessibility and regulatory DNA. Covers the biological basis of open chromatin regions, various protocols for measuring accessibility (DNase-seq, FAIRE-seq, ATAC-seq), and basic computational analysis approaches. ATAC-seq uses Tn5 transposase to identify accessible chromatin regions genome-wide.
Three complementary tools that address common bioinformatics challenges: Refgenie manages reference genome resources, PEP (Portable Encapsulated Projects) standardizes sample metadata and project organization, and Bulker provides containerized computing environments. Together, these tools create a reproducible computational biology workflow.
This presentation introduces refgenie, a reference genome asset manager that provides both a command-line interface and server-based distribution of genome assets. It covers the components of refgenie, the refget API implementation, and how collection checksums provide genome provenance tracking.
BiocProject integrates Portable Encapsulated Projects (PEP) with Bioconductor, providing automated data loading, functions for interacting with project metadata, and PEP-annotated Bioconductor data objects.
COCOA (Coordinate Covariation Analysis) provides a method to understand continuous regulatory variation in high-dimensional epigenetic data by quantifying variation with PCA and annotating principal components with region sets.
DNA methylation is a covalent modification that plays critical roles in gene regulation, development, and disease. This lecture covers the biology of DNA methylation including its distribution patterns, heritability, and clinical relevance. The bisulfite sequencing section introduces techniques for measuring methylation including RRBS and WGBS, alignment challenges, and the time-scale balance between regulation and epigenetic memory.
A historical and philosophical introduction to epigenetics and epigenomics, exploring various definitions and perspectives on the field.
Comprehensive introduction to two major epigenomics technologies for measuring the regulatory genome: ATAC-seq for chromatin accessibility and Bisulfite-seq for DNA methylation. Covers the biological basis of each technology, experimental protocols, data analysis approaches, and advanced concepts including sample pooling and integrative multi-omics analysis. Includes introduction to LOLA for genomic region enrichment analysis.
I will outline a motivation and implementation of PEP, an open toolkit for organizing large-scale, sample-intensive biological research projects around a standardized structure. I will describe a series of modular tools that fit together to enable a scientist to process genome sequencing data from raw format through processed output. I will then describe our group's work on novel tools and algorithms for interpreting epigenetic signals via aggregating collections of DNA regulatory elements.
Discussion of the layers of collaboration in software development and introduction to git and GitHub
An introduction to PEP, and brief overview of PEPATAC, LOLA, and MIRA
A discussion of standardizing project metadata with Portable Encapsulated Projects and a presentation of PEPATAC, an ATAC-seq pipeline that reads standard PEP projects.
PEPATAC is an optimized ATAC-seq analysis pipeline that uses serial alignments to handle mitochondrial DNA contamination. This short presentation covers the modular design, prealignment strategy, and practical usage of PEPATAC for ATAC-seq analysis.
Presentation of the AIList (Augmented Interval List) algorithm for efficient genomic interval overlap queries, comparing its performance to existing approaches like R-trees and Nested Containment Lists.
LOLA (Locus Overlap Analysis) is a computational method for enrichment analysis of genomic ranges. It quantifies and visualizes the overlap between a query region set and a database of reference region sets, enabling biological interpretation of genomic intervals.
Short talk introducing the LOLA bioconductor package.
MIRA is a bioconductor package for analyzing DNA methylation data to infer regulatory activity. This presentation introduces the concept of region pooling and how MIRA uses bisulfite sequencing data to understand regulatory elements.
The Portable Encapsulated Project (PEP) specification provides a standardized way to organize sample metadata and project configuration for computational biology. This presentation introduces the PEP format, its motivation (the 'microwave syndrome' of incompatible sample metadata), and demonstrates the pepkit ecosystem including Python (peppy) and R (pepr) implementations. Covers project organization, sample modifiers, and schema validation.
In this presentation I will start by introducing two R packages for epigenome analysis. First, LOLA (Locus Overlap Analysis), which identifies enrichments of genomic ranges in public databases, linking newly generated epigenome data to large existing data sets. Second, MIRA (Methylation-based Inference of Regulatory Activity), which uses aggregate DNA methylation patterns data to infer regulatory activity at collections of regions of interest. I will demonstrate the utility of these tools in understanding the rare pediatric cancer Ewing sarcoma. Finally, I will present pepkit, an open toolkit for organizing large-scale, sample-intensive biological research projects around a standardized structure called Portable Encapsulated Projects.
Pypiper is a user-friendly large-data informatics pipeline framework. Looper is a project manager that distributes jobs across a cluster for you. This presentation covers Jordan's story of pipeline development, from initial manual processing to scripting, error handling, scaling up to 500 samples, and finally to a comprehensive pipeline management system using Pypiper and Looper together.
Scientific writing clarity stems not from topic complexity, but from four common problems: subjects and verbs too far apart, overabundance of nominalizations, poor information flow (old-to-new), and excessive passive voice. This presentation teaches revision techniques including omitting needless words, putting actions in verbs, using summarizing nominalizations, placing verbs near subjects, and putting familiar information first. Includes numerous examples and practice exercises.
Intro and literature review of challenges and approaches to single-cell epigenomics analysis. Presentation for the 2017 American Foundation for AIDS Research Think Tank Meeting.
Exploring DNA methylation data in a rare pediatric cancer.
Brief introduction to research in the group
Docker provides version-controlled, reproducible computing environments by packaging applications with all dependencies. This presentation covers Docker basics, terminology, and Dockerfile construction. Three use cases demonstrate practical applications: containerizing R CMD check and BiocCheck for package development, packaging analyses as deployable applications with ENTRYPOINT, and maintaining personal/team R containers for working from anywhere. Includes live demonstrations.