Slides

Presentations from conferences, workshops, and seminars

How DNA Sequencing Works

By Nathan Sheffield

An animated walkthrough of DNA sequencing from cheek swab to interpretation — covering cells, chromosomes, alleles, variants, and Punnett squares.

outreach

Identifiers

By Nathan Sheffield

An exploration of identifiers in bioinformatics, covering what identifiers are, their scope from local to global, the pros and cons of authority-based identifiers versus content-derived identifiers, and a proposed priority order for choosing identifier types for maximum interoperability.

researchshort

Representation learning of the epigenome

By Nathan Sheffield, PhD

Assays of the human epigenome capture the regulatory state of cells in health and disease. With tens of thousands of experiments completed, it is now feasible to train large-scale representation models. In this talk, I will share how we approach this challenge by first abstracting all epigenome data into genomic intervals. I will describe several types of neural embedding models we use to investigate what biological questions can be addressed with epigenome region embeddings. I will also describe our efforts to curate, standardize, annotate, and share genomic interval data to make it more broadly useful for machine learning and beyond.

research

PhD Program in Computational Biology at UVA

By Nathan Sheffield

Recruitment presentation for UVA's Computational Biology PhD program, covering coursework, timeline, resources, and what makes UVA's approach to computational biology unique.

outreach

Introduction to PEPkit

By Nathan Sheffield

Introduction to PEPkit, a suite of software that helps make sample metadata reusable.

short

Methods in computational epigenetics

By Nathan Sheffield

This presentation provides an introduction to several computational methods for epigenetics analysis developed in the Sheffield lab, including LOLA (Locus Overlap Analysis), MIRA (Methylation-based Inference of Regulatory Activity), and COCOA (Coordinate Covariation Analysis).

research

Building computational analysis pipelines

By Nathan Sheffield

This presentation covers the fundamentals of building computational analysis pipelines, from interactive computing to shell scripts and pipeline frameworks. It introduces the Sheffield Lab's modular pipeline engineering ecosystem, including tools like PEP, pypiper, looper, pipestat, refgenie, and others that enable flexible, interoperable bioinformatics workflows.

research

This talk presents novel machine learning approaches for analysis of genomic regions, including methods for interval set comparison, vector representations of genomic region sets, and building interval universes.

research

Making sense of genomic intervals

By Nathan Sheffield

This presentation explores methods for analyzing genomic intervals, from data structures for efficient comparison to novel machine learning approaches for region embeddings. It covers background on genomic intervals, LOLA for enrichment analysis, AIList and IGD for fast interval operations, RegionSet2vec for machine learning-based analysis, and BEDbase as a data repository.

research

Most GWAS associations have been found outside protein-coding genes, leading to growing interest in studying non-coding DNA to understand biology and treat disease. In this talk, I'll present my group's recent work on methods development for analysis of genomic intervals. Through algorithmic advances and novel machine learning applications, we have developed several approaches to study non-coding genomic intervals in new ways. I'll also present applications and results of these approaches to recent questions in cancer biology.

research

Epigenomes and intervals

By Nathan Sheffield

Introductions to epigenomics and interval analysis projects in the lab

lecture

Introduction to Sequence Collections

By Nathan Sheffield

Sequence collections extend the refget standard to provide unique identifiers for collections of sequences like reference genomes. This presentation covers the problem of genome variability across providers, the digest algorithm for creating deterministic identifiers, and the comparison API for assessing compatibility between different genome assemblies.

short

New computational tools for epigenome analysis

By Nathan Sheffield

Presentation of several recent developments for the UNC Human Functional Genomics Group. The PEPATAC ATAC-seq pipeline, Sequence Collections, and RegionSet2Vec.

research

Recent advances in genomic interval analysis

By Nathan Sheffield

An overview of several Sheffield lab projects for new algorithms, machine learning approaches, and databases to handle the growing corpus of genomic interval data (BED files).

research

Two complementary approaches to working with Common Workflow Language (CWL) workflows: First, using Looper to scatter CWL workflows across tabular sample data, solving the problem of running workflows on CSV sample tables. Second, using Bulker to create portable, interactive computing environments from CWL tool definitions, enabling troubleshooting and interactive analysis with the same containerized tools used in workflows.

lecture

Refgenie and bioconductor

By Nathan Sheffield

This lightning talk introduces refgenie, a reference genome asset management system, with a focus on how to use it from within R and bioconductor workflows. Refgenie simplifies the process of obtaining and managing genome reference assets.

short

ATAC-seq analysis via pipeline

By Nathan Sheffield

This lecture introduces the concept of computational pipelines for processing sequencing data, with specific focus on PEPATAC (PEP-compatible ATAC-seq pipeline). Covers pipeline basics, ATAC-seq specific considerations, and quality control metrics for ATAC-seq data analysis.

lecture

Open chromatin and ATAC-seq

By Nathan Sheffield

Introduction to chromatin accessibility and regulatory DNA. Covers the biological basis of open chromatin regions, various protocols for measuring accessibility (DNase-seq, FAIRE-seq, ATAC-seq), and basic computational analysis approaches. ATAC-seq uses Tn5 transposase to identify accessible chromatin regions genome-wide.

lecture

Refgenie, PEP, and bulker

By Nathan Sheffield

Three complementary tools that address common bioinformatics challenges: Refgenie manages reference genome resources, PEP (Portable Encapsulated Projects) standardizes sample metadata and project organization, and Bulker provides containerized computing environments. Together, these tools create a reproducible computational biology workflow.

lecture

Refgenie and refget

By Nathan Sheffield

This presentation introduces refgenie, a reference genome asset manager that provides both a command-line interface and server-based distribution of genome assets. It covers the components of refgenie, the refget API implementation, and how collection checksums provide genome provenance tracking.

short

BiocProject: a Bioconductor-oriented project management package

By Nathan Sheffield, Michal Stolarczyk

BiocProject integrates Portable Encapsulated Projects (PEP) with Bioconductor, providing automated data loading, functions for interacting with project metadata, and PEP-annotated Bioconductor data objects.

short

Coordinate covariation analysis

By John Lawson, Nathan Sheffield

COCOA (Coordinate Covariation Analysis) provides a method to understand continuous regulatory variation in high-dimensional epigenetic data by quantifying variation with PCA and annotating principal components with region sets.

short

DNA methylation and bisulfite-seq

By Nathan Sheffield

DNA methylation is a covalent modification that plays critical roles in gene regulation, development, and disease. This lecture covers the biology of DNA methylation including its distribution patterns, heritability, and clinical relevance. The bisulfite sequencing section introduces techniques for measuring methylation including RRBS and WGBS, alignment challenges, and the time-scale balance between regulation and epigenetic memory.

lecture

What is epigenetics?

By Nathan Sheffield

A historical and philosophical introduction to epigenetics and epigenomics, exploring various definitions and perspectives on the field.

lecture

Epigenome tools: ATAC-seq and Bisulfite-seq

By Nathan Sheffield

Comprehensive introduction to two major epigenomics technologies for measuring the regulatory genome: ATAC-seq for chromatin accessibility and Bisulfite-seq for DNA methylation. Covers the biological basis of each technology, experimental protocols, data analysis approaches, and advanced concepts including sample pooling and integrative multi-omics analysis. Includes introduction to LOLA for genomic region enrichment analysis.

lecture

I will outline a motivation and implementation of PEP, an open toolkit for organizing large-scale, sample-intensive biological research projects around a standardized structure. I will describe a series of modular tools that fit together to enable a scientist to process genome sequencing data from raw format through processed output. I will then describe our group's work on novel tools and algorithms for interpreting epigenetic signals via aggregating collections of DNA regulatory elements.

lecture

Collaborative software development

By Nathan Sheffield

Discussion of the layers of collaboration in software development and introduction to git and GitHub

skills

ATAC-seq pipeline processing

By Jason Smith, Nathan Sheffield

PEPATAC is an optimized ATAC-seq analysis pipeline that uses serial alignments to handle mitochondrial DNA contamination. This short presentation covers the modular design, prealignment strategy, and practical usage of PEPATAC for ATAC-seq analysis.

short

Augmented Interval List

By Nathan Sheffield, Jianglin Feng

Presentation of the AIList (Augmented Interval List) algorithm for efficient genomic interval overlap queries, comparing its performance to existing approaches like R-trees and Nested Containment Lists.

short

Locus overlap analysis

By Nathan Sheffield

LOLA (Locus Overlap Analysis) is a computational method for enrichment analysis of genomic ranges. It quantifies and visualizes the overlap between a query region set and a database of reference region sets, enabling biological interpretation of genomic intervals.

short

Locus overlap analysis

By Nathan Sheffield

Short talk introducing the LOLA bioconductor package.

short

Methylation-based Inference of Regulatory Activity

By Nathan Sheffield

MIRA is a bioconductor package for analyzing DNA methylation data to infer regulatory activity. This presentation introduces the concept of region pooling and how MIRA uses bisulfite sequencing data to understand regulatory elements.

short

The Portable Encapsulated Project (PEP) specification provides a standardized way to organize sample metadata and project configuration for computational biology. This presentation introduces the PEP format, its motivation (the 'microwave syndrome' of incompatible sample metadata), and demonstrates the pepkit ecosystem including Python (peppy) and R (pepr) implementations. Covers project organization, sample modifiers, and schema validation.

lecture

In this presentation I will start by introducing two R packages for epigenome analysis. First, LOLA (Locus Overlap Analysis), which identifies enrichments of genomic ranges in public databases, linking newly generated epigenome data to large existing data sets. Second, MIRA (Methylation-based Inference of Regulatory Activity), which uses aggregate DNA methylation patterns data to infer regulatory activity at collections of regions of interest. I will demonstrate the utility of these tools in understanding the rare pediatric cancer Ewing sarcoma. Finally, I will present pepkit, an open toolkit for organizing large-scale, sample-intensive biological research projects around a standardized structure called Portable Encapsulated Projects.

lecture

Pypiper is a user-friendly large-data informatics pipeline framework. Looper is a project manager that distributes jobs across a cluster for you. This presentation covers Jordan's story of pipeline development, from initial manual processing to scripting, error handling, scaling up to 500 samples, and finally to a comprehensive pipeline management system using Pypiper and Looper together.

lecture

Clarity: Strategies for revising scientific writing

By Nathan Sheffield

Scientific writing clarity stems not from topic complexity, but from four common problems: subjects and verbs too far apart, overabundance of nominalizations, poor information flow (old-to-new), and excessive passive voice. This presentation teaches revision techniques including omitting needless words, putting actions in verbs, using summarizing nominalizations, placing verbs near subjects, and putting familiar information first. Includes numerous examples and practice exercises.

skills

Single cell epigenomics

By Nathan Sheffield

Intro and literature review of challenges and approaches to single-cell epigenomics analysis. Presentation for the 2017 American Foundation for AIDS Research Think Tank Meeting.

lecture

Docker provides version-controlled, reproducible computing environments by packaging applications with all dependencies. This presentation covers Docker basics, terminology, and Dockerfile construction. Three use cases demonstrate practical applications: containerizing R CMD check and BiocCheck for package development, packaging analyses as deployable applications with ENTRYPOINT, and maintaining personal/team R containers for working from anywhere. Includes live demonstrations.

lecture