Representation learning of the epigenome

# Representation learning of the epigenome

Nathan Sheffield, PhD

<span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span>

</section>
---

### Outline

<div class="previewblock" style="width:10%">Lab intro</div>
<div class="previewblock" style="width:10%"></div>
<div class="previewblock" style="width:60%">Genomic interval embeddings</div>
<div class="previewblock" style="width:20%"></div>

<br clear="all">
<div class="previewblock" style="width:10%"></div>
<div class="previewblock" style="width:10%">Background</div>
<div class="previewblock" style="width:60%"></div>
<div class="previewblock" style="width:20%">BEDbase</div>

<div class="questionblock" style="background:#222; color:#eee; font-size: 0.6em; margin-top: 35px">&#9665; Questions &#9655;</div>
---

## Full-stack bioinformatics

<img src="/_modules/full-stack-bioinformatics-teaser/pyramid_blue.svg" width="500">
---

Reference genome manager

<div class="small">
<a href="http://refgenie.databio.org">http://refgenie.databio.org</a><br>
</div>
<span class="small bullet"><img src="/_modules/refgenie-teaser/paper.svg" height="25" class="bullet"><a href="https://www.biorxiv.org/content/10.1093/gigascience/giz149">Stolarczyk et al. (2020).</a> <i>GigaScience</i>.</span><br/>
<span class="small bullet"><img src="/_modules/refgenie-teaser/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021).</a> <i>NAR Genomics and Bioinformatics</i>.</span><br/>
</div>

<pre><code>refgenie pull hg38/bowtie2_index</code></pre>

</div>
---

## GA4GH refget sequence collections standard

---

<span class="small bullet">
<img src="/_modules/pephub-teaser/pephub_logo_white.svg" style="width: 250px" class="bullet"> Sample metadata API</span>

<div class="small">
<a href="https://pephub.databio.org">https://pephub.databio.org</a><br>
</div>
---

<span class="fragment">A universal language of computational genomics</span>

---

## There are many sources of genomic interval data

![](/_modules/bio-motivation-genomic-intervals/interval-sources.svg)

---

## Genomic interval data is growing

![](/_modules/bio-motivation-genomic-intervals/geo-count-total-v2.svg)
---

<img src="/slides/representation-learning-epigenome/self-organizing-map.svg" width="100%">
<br><span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://dx.doi.org/10.1101/gr.152140.112">Sheffield et al. (2013). <i>Genome Research</i>.</a></span>

---

## Collections of similar genomic intervals
## can be very powerful

---

### Locus Overlap Analysis

<div class="small">
<a href="http://code.databio.org/LOLA/">http://code.databio.org/LOLA/</a><br>
</div>
<span class="small bullet"><a href="https://doi.org/10.1093/bioinformatics/btv612"><img src="/_modules/lola-intro/paper.svg" height="25" class="bullet">Sheffield and Bock (2016). <i>Bioinformatics</i>.</a></span><br/>
<span class="small bullet"><img src="/_modules/lola-intro/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nar/gky464">Nagraj, Magee, and Sheffield (2018). <i>Nucleic Acids Research</i>.</a></span>

---

---

---

<img src="/_modules/lola-intro/09-test.svg" />
---

### LOLA requires comparing sets of intervals

<br>Can we improve the efficiency to enable faster,<br>larger-scale analysis?

---

<div>
<h3>Augmented Interval List (AIList)</h3>

A novel data structure for efficiently computing overlaps <br>
across genomic interval data.<br>

<div class="small">
<a href="http://ailist.databio.org/">http://ailist.databio.org/</a><br>
</div>
</div>
<span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btz407">Feng et al. (2020). <i>Bioinformatics</i>.</a></span>

<br>
<div>
<h3>Integrated Genome Database (IGD)</h3>

A high-performance search engine <br> for large-scale genomic interval datasets.

<div class="small">
<a href="https://github.com/databio/IGD">https://github.com/databio/IGD</a><br>
</div>
</div>
<span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btaa1062">Feng and Sheffield (2021). <i>Bioinformatics</i>.</a></span>

---

## But is counting overlaps
## really the right approach?

---

What does it mean for two region sets to be similar?<br>
What am I really looking for in a region set query?

<div class="fragment">Overlaps makes some sense...but what about: <br>

degree of overlap?</div>
<div class="fragment">biological function of each region?</div>
<div class="fragment">weighting of specific regions?</div>
<div class="fragment">relationships among regions?</div>
<div class="fragment">background context?</div>

---

### The bag-of-words model for text classification

<span class="fragment">What about a bag-of-intervals model for genomic intervals?</span>

---

### The bag-of-intervals model for genomic intervals

<ul class="fragment">Advantages
	<li>Vector representation of a region set</li>
	<li>Similarity metrics among vectors</li>
	<li>Lower space and time complexity than interval sets</li>
</ul>

---

### Limitations of the bag of words vector approach

<ul>
	<li>Sparsity</li>
	<li>Curse of dimensionality</li>
	<li>Space and time complexity are still an issue</li>
	<li>No concept of relationships among words</li>
<code><pre style="color:#AAAAFF; text-align:center">
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
motel = [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
</pre></code>
</ul>
---

<div>
<h3>Region-set 2 Vec</h3>

Embeddings of genomic region sets <br> in lower dimensions.

<div class="small">
<a href="https://github.com/databio/regionset-embedding">https://github.com/databio/regionset-embedding</a><br>
</div>
</div>
<span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btab439">Gharavi et al. (2021). <i>Bioinformatics</i>.</a></span>

---

### Word embeddings

<div class="small">http://suriyadeepan.github.io</div>

---

### Word2vec model

<img src="/_modules/regionset2vec/mikolov2013_fig1.png" width="680">
<br><span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://arxiv.org/abs/1301.3781">Mikolov et al. (2013). <i>arXiv:1301.3781v3</i>.</a></span>

---

### Word context

<div class="well">
	You shall know a word by the company it keeps. (Firth 1957)<br>
	Words that occur in similar contexts tend to have similar meanings.
</div>
<div class="small">Image credit: Shubham Agarwal</div>

---

### Genomic context

<div class="well">
	A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function.
</div>

---

---

### Genomic Interval Embeddings

---

### Evaluation

We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.<br>
Do relationships among vectors reflect biology?

---

## Evaluation 1: Classification performance

---

## Evaluation 1: Classification performance

---

### Evaluation 1: Classification performance

---

### Conclusion

<ul>
	<li>Regionset2vec adapts word2vec to learn genomic region embeddings</li>
	<li>Regionset2vec embeddings capture biological information</li>
	<li>NLP approaches can be adapted for applications in genomic interval analysis</li>
</ul>
---

## Region embeddings are highly tunable

- Universe selection
- Tokenization
- Model architecture
- Extent of training
- Context window size
- Learning rate

## This opens lots of new questions

<br><span class="fragment">Can we do this with single-cell data?</span>
<br><span class="fragment">How can we evaluate region embeddings?</span>
<br><span class="fragment">How can we make the best embeddings?</span>
<br><span class="fragment">How do we choose a good universe?</span>
<br><span class="fragment">Are there better model architectures or training tasks?</span>
<br><span class="fragment">Can we increase or change the data source?</span>

---

Can we do this with single-cell data?

---

<div>
<h3>scEmbed</h3>

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings

<div class="small">
<a href="https://github.com/databio/regionset-embedding">https://github.com/databio/geniml</a><br>
</div>
</div>
<span class="small bullet"><img src="/_modules/scembed/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqae073">LeRoy et al. (2024). <i>NAR Genomics and Bioinformatics</i>.</a></span>

---

### scEmbed training pipeline

---

### Region embeddings can produce cell embeddings

---

### scEmbed benchmarks competitively

---

### scEmbed is robust to severe data loss

---

### scEmbed uses a unique two-step approach

---

### Projection embeds new data with pre-trained model

---

### Projected embeddings look similar <br>to trained embeddings

---

### Projection enables multiple data flows

---

### Embedding-projection gives nice cell clusters

---

### EV-projection places new data <br>in a pre-trained latent space

---

### EV-projection allows for high-accuracy cell annotation

<img src="/_modules/scembed/projection-annotation.svg" width="800">
---

## Where we're going with scembed

<div class="col2">
<h3>Atlas-scale trained model</h3>
	<img src="/slides/representation-learning-epigenome/atlas_umap.png" width="550">
</div>

<div class="col2 fragment">
<h3>Cross-cell-type accessibility</h3>
	<img src="/slides/representation-learning-epigenome/atlas_cell_type_accessibility.png" width="550">
</div>

---

How can we evaluate region embeddings?

---

## Methods for evaluating unsupervised <br/> vector representations of genomic regions

- Cluster Tendancy Score (CTS)
- Reconstruction Score (RCS)
- Neighborhood preserving score (NPS)
- Genome distance scaling score (GDSS)

<span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqae086">Zheng et al. (2024). <i>NAR Genomics and Bioinformatics</i>.</a></span>

---

## Cluster tendancy score (CTS)

<div class="well">
	Assumption:<br>
	 Embeddings that form clusters are more likely to be useful than embeddings that are diffuse.
</div>

---

## Reconstruction score (RCS)

<div class="well">
	 Measures how well the embeddings can be used to reconstruct the full-dimensional input data.
</div>

---

## Genome distance scaling score (GDSS)

<div class="well">
	Assumption:<br>
	 Proximity in linear genome space <br>increases probability of similar function
</div>
<div class="col2">
	<img src="/slides/representation-learning-epigenome/grouped-average-distances-schematic.svg" width="400">
</div>

<div class="col2 fragment">
	Results
	<img src="/slides/representation-learning-epigenome/grouped-average-distances-results.svg" width="400">
</div>

---

## Neighborhood preserving score (NPS)

<div class="well">
	Assumption:<br>
	 Proximity in linear genome space <br>increases probability of similar function
</div>
<div class="col2">

<div class="col2 fragment">
	Results
	<img src="/slides/representation-learning-epigenome/neighborhood-preserving-results.svg" width="400">
</div>

---

## Embedding eval results

---

## Embedding eval results

---

How do we choose a good universe?

---

## Methods for constructing and evaluating <br/> consensus genomic interval sets

<span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nar/gkae685">Rymuza et al. (2024). <i>Nucleic Acids Research</i>.</a></span>

---

Are there better model architectures or training tasks?

---

Joint representation learning <br> of genomic interval sets and metadata

---

---

## Atacformer model

---

## Atacformer preliminary results

---

## Neural networks are like pigs.

## If you want them to be useful,

## you have to feed them a lot.

---

Can we increase or change the data source?

---

iframe {
    width: 97% !important;
    height: 85% !important;
    -webkit-transform: scale(0.65);
    transform: scale(0.65);
    -webkit-transform-origin: 0 0;
    transform-origin: 0 0;
}
</style>

<div>
<h3>BEDbase</h3>

A high-performance server and API <br> for genomic interval data.

<br><span class="small bullet"><img src="/_modules/bedbase/web.svg" height="30" class="bullet"><a href="https://bedbase.org">bedbase.org</a></span>

</div>

---

<div>
<img src="/_modules/bedbase/bedbase_logo.svg" width="275">
<br>
A high-performance server and API <br> for genomic interval data.

<div class="small">
<a href="http://bedbase.org">http://bedbase.org</a><br>
</div>
</div>
<ul>
<li>Data spans projects (*e.g.* all data on GEO; 40,000 accessions, 100,000+ BED files)</li>
<li>Programmatic API for metadata, statistics, and data chunks</li>
<li>Human browsing of statistical and biological attributes</li>
<li>Aware of similarities among BED files</li>
<li>Human-friendly search</li>
<li>Shaped into 'non-redundant' sets for analysis</li>
</ul>

---

### BEDbase goals

- Human browsing of statistical and biological attributes
- Human-friendly, *intelligent* search
- Programmatic API for metadata, statistics, and data chunks
- Integrative analytical results
- Data spans projects (all data on GEO)

---

### BEDbase architecture

---

<br>BEDbase is a microservice for data interoperability,<br> not another cloud platform

---

<ol>
	<li style="color:yellow">Web interface (front-end)</li>
	<li style="">Clients (front-end)</li>
	<li style="">API (back-end)</li>
	<li style="">Database and files (back-end/infrastructure)</li>
	<li style="">Processing pipelines (infrastructure)</li>
	<li style="">Data served (content)</li>
</ol>

---

### Human browsing of BED file splash pages

<span class="small"><a href="https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1">https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1</a></span>
<div class="wrap">
<iframe src="https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen ></iframe>
</div>

---

### Reference genome compatibility

---

### Reference genome compatibility

---

### Reference genome compatibility

---

### BEDsets allow comparison of BED files

<span class="small"><a href="https://bedbase.org/bedset/gse246900">https://bedbase.org/bedset/gse246900</a></span>
<div class="wrap">
<iframe src="https://bedbase.org/bedset/gse246900" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen ></iframe>
</div>

---

### Human-friendly search

<span class="small">Co-embedded metadata and region sets: <a href="https://bedbase.org/search?q=brain">https://bedbase.org/search?q=brain</a></span>

---

### Human-friendly search

---

### Human-friendly search

---

### Search by BED file

<span class="small">Co-embedded metadata and region sets: <a href="https://bedbase.org/search?view=b2b">https://bedbase.org/search?view=b2b</a></span>

---

<ol>
	<li style="">Web interface (front-end)</li>
	<li style="color:yellow">Clients (front-end)</li>
	<li style="">API (back-end)</li>
	<li style="">Database and files (back-end/infrastructure)</li>
	<li style="">Processing pipelines (infrastructure)</li>
	<li style="">Data served (content)</li>
</ol>

---

# BEDbase R client

---

```
library("bedbaser")
bedbase <- BEDbase(tempdir())
bb_to_granges(bedbase, "ab446df9a043222067863cfd536ee8e0")
```

```
GRanges object with 37 ranges and 5 metadata columns:
       seqnames            ranges strand |                   name     score
          [Rle]         [IRanges]  [Rle] |            [character] [integer]
   [1]    chr17 38083473-38083801      * | O-8A-H3K27ac_peak_11..        21
   [2]    chr17 38108871-38110066      * | O-8A-H3K27ac_peak_11..        25
   [3]    chr17 38137142-38137795      * | O-8A-H3K27ac_peak_11..        33
   [4]    chr17 38210828-38211063      * | O-8A-H3K27ac_peak_11..        17
   [5]    chr17 38218030-38220186      * | O-8A-H3K27ac_peak_11..        19
   ...      ...               ...    ... .                    ...       ...
  [33]    chr17 38603620-38604355      * | O-8A-H3K27ac_peak_11..        15
  [34]    chr17 38647047-38648053      * | O-8A-H3K27ac_peak_11..        28
  [35]    chr17 38708445-38710424      * | O-8A-H3K27ac_peak_11..        20
  [36]    chr17 38716283-38717201      * | O-8A-H3K27ac_peak_11..        23
  [37]    chr17 38803702-38804538      * | O-8A-H3K27ac_peak_11..        46
            field8      field9     field10
       [character] [character] [character]
   [1]     3.91024     4.82245     2.12195
   [2]     4.44410     5.31600     2.51785
   [3]     4.30183     6.25865     3.33754
   [4]     3.94862     4.34253     1.75554
   [5]     3.74929     4.57115     1.93712
   ...         ...         ...         ...
  [33]     3.80116     4.07741     1.55623
  [34]     4.35182     5.72967     2.89524
  [35]     3.88836     4.73101     2.06784
  [36]     4.24210     5.08399     2.33889
  [37]     5.09106     7.93609     4.68279
  -------
```

---

# BEDbase Python client

---

<ol>
	<li style="">Web interface (front-end)</li>
	<li style="">Clients (front-end)</li>
	<li style="color:yellow">API (back-end)</li>
	<li style="">Database and files (back-end/infrastructure)</li>
	<li style="">Processing pipelines (infrastructure)</li>
	<li style="">Data served (content)</li>
</ol>

---

## bedhost

A FastAPI application following JAMstack philosophy.<br>

<span class="fragment">JAMstack forces you to build a comprehensive API. </span>

---

### OpenAPI interface

<span class="small"><a href="https://api.bedbase.org/v1/docs">https://api.bedbase.org/v1/docs</a></span>
<div class="wrap">
<iframe src="https://api.bedbase.org/v1/docs" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="background: #FFFFFF;"></iframe>
</div>

---

### BED info via API

<span class="small"><a href="https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true">https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true</a></span>
<div class="wrap">
<iframe src="https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="background: #FFFFFF;"></iframe>
</div>

---

<ol>
	<li style="">Web interface (front-end)</li>
	<li style="">Clients (front-end)</li>
	<li style="">API (back-end)</li>
	<li style="color:yellow">Database and files (back-end/infrastructure)</li>
	<li style="">Processing pipelines (infrastructure)</li>
	<li style="">Data served (content)</li>
</ol>

---

## BEDbase data layer

1. BED file stored in Backblaze B2 (S3 compatible object store)
	- BED files n=21,438 (stats from 2025-05)
	- 346,071 total objects, 186.4 GB  ($6/TB/month)
2. B2 interface is routed through cloudflare CDN (free egress!)
3. File metadata stored in a PostgreSQL database on AWS managed Relational Database Service

## &rarr; Minimal maintenance cost

---

<ol>
	<li style="">Web interface (front-end)</li>
	<li style="">Clients (front-end)</li>
	<li style="">API (back-end)</li>
	<li style="">Database and files (back-end/infrastructure)</li>
	<li style="color:yellow">Processing pipelines (infrastructure)</li>
	<li style="">Data served (content)</li>
</ol>

---

- `bbconf`: bedbase configuration object, connection to database
- `bedqc`: a pipeline for QC of BED files.
- `bedmaker`: a pipeline to convert non-bed files into bed files
- `bedstat`: a pipeline to calculate stats for a bed file
- `bedbuncher`: a pipeline to create bedsets
- `bedembed`: a pipeline to create bed file embeddings

---

<img src="/_modules/genomicdistributions/genomic_distributions_dark.svg" width="275" style="padding-top:5px; padding-bottom:5px">
<div class="small">
Docs: <a href="http://code.databio.org/GenomicDistributions/">http://code.databio.org/GenomicDistributions/</a><br>
Code: <a href="http://github.com/databio/GenomicDistributions/">http://github.com/databio/GenomicDistributions/</a><br>
<div class="bullet">
<a href="https://bioconductor.org/packages/GenomicDistributions"><img src="/_modules/genomicdistributions/bioconductor_logo_grey.svg" height="22" style="padding-top:5px; padding-bottom:5px"> bioconductor.org/packages/GenomicDistributions</a><br>
</div>
</div>

<span class="small bullet"><img src="/_modules/genomicdistributions/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1186/s12864-022-08467-y">Kupkova et al. (2022).</a> <i>BMC Genomics</i>.</span><br/>

---

<img src="/_modules/genomicdistributions/genomicdistributions_summary.svg" width="100%">
---

<ol>
	<li style="">Web interface (front-end)</li>
	<li style="">Clients (front-end)</li>
	<li style="">API (back-end)</li>
	<li style="">Database and files (back-end/infrastructure)</li>
	<li style="">Processing pipelines (infrastructure)</li>
	<li style="color:yellow">Data served (content)</li>
</ol>

---

<div>
<img src="/_modules/geofetch/geofetch_logo.svg" width="275"><br>
Connects the Gene Expression Omnibus (GEO) <br>
and Sequence Read Archive (SRA) <br>
with PEP format<br>

</div>

<br><span class="small bullet"><img src="/_modules/geofetch/web.svg" height="25" class="bullet"><a href="https://geofetch.databio.org">geofetch.databio.org</a></span>
---

```
geofetch --filter="bed|bigBed|narrowPeak|broadPeak"
```

---

### Conclusion

<ul>
	<li>BEDbase provides BED data for humans and machines</li>
	<li>Output includes statistical and biological visualization</li>
	<li>Upcoming human-friendly search is powerful</li>
	<li>Programmatic access to data chunks improve interoperability</li>
</ul>
---

## Take-home points

- Analysis of genomic intervals is fruitful and interesting
- NLP methods are the future of genomic interval analysis
- BEDbase can become the 'one-stop-shop' for interval data
- Neural networks are like pigs

---

## Thank You

<div class="small bullet">
    Collaborators: Aidong Zhang, Don Brown<br>
    Funding:  R01-HG012558; R35-GM128536
</div>

<span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> &middot;
<span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> &middot;
<span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span>