Machine learning approaches for analysis of genomic regions

# Machine learning approaches for analysis of genomic regions

Nathan Sheffield

<span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span>

</section>
---

# Talk outline

- Background
- R01 aims
	- Aim 1: Interval set universes
	- Aim 2: Word embeddings → genomic interval embeddings
	- Aim 3: A Hidden Markov Model for constructing genomic interval universes

---

# Motivation

What does the genome *encode*?

Sequence → Function

---

---

---

How can we integrate all this data?

---

## Example integration tasks

- Define derived sets by computing overlaps
- Identify similarity among interval sets
- Given a new interval set, find the most similar existing sets
- Extract experiment signal levels at a set of regions
- Build meta-locus plots by averaging signal across intervals

---

## Novel methods for large-scale genomic interval comparison

Aim 1:
Develop more efficient algorithms for interval set comparison

Aim 2:
Develop and evaluate vector representations
of genomic region sets

Aim 3:
Develop methods for building
and evaluating interval universes

NHGRI R01-HG012558 (2022-08 to 2026-05)

---

## Aim 1

Develop more efficient algorithms for interval set comparison

---

What does it mean for two region sets to be similar?<br>
What am I really looking for in a region set query?

<div class="fragment">Overlaps makes some sense...but what about: <br>

degree of overlap?</div>
<div class="fragment">weighting of specific regions?</div>
<div class="fragment">relationships among regions?</div>
<div class="fragment">biological function of each region?</div>

---

### The bag-of-words model for text classification

<span class="fragment">What about a bag-of-intervals model for genomic intervals?</span>

---

### The bag-of-intervals model for genomic intervals

<ul class="fragment">Advantages
<li>Vector representation of a region set</li>
<li>Similarity metrics among vectors</li>
<li>Lower space and time complexity than interval sets</li>
</ul>

---

# Bloom filter

A space-efficient probabilistic data structure that tests whether an element is a member of a set. A query returns either "possibly in set" or "definitely not in set".

Image: wikipedia

---

### Limitations of the bag of words vector approach

<ul>
	<li>Sparsity</li>
	<li>Curse of dimensionality</li>
	<li>Space and time complexity are still an issue</li>
	<li>No concept of relationships among words</li>
<code><pre style="color:#AAAAFF; text-align:center">
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
motel = [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
</pre></code>
</ul>

---

## Aim 2

Develop and evaluate vector representations
of genomic region sets

---

<div>
<h3>Region-set 2 Vec</h3>

Embeddings of genomic region sets <br> in lower dimensions.

<div class="small">
<a href="https://github.com/databio/regionset-embedding">https://github.com/databio/regionset-embedding</a><br>
</div>
</div>
<span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btab439">Gharavi et al. (2021). <i>Bioinformatics</i>.</a></span>

---

### Word embeddings

<div class="small">http://suriyadeepan.github.io</div>

---

### Word2vec model

<img src="/_modules/regionset2vec/mikolov2013_fig1.png" width="680">
<br><span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://arxiv.org/abs/1301.3781">Mikolov et al. (2013). <i>arXiv:1301.3781v3</i>.</a></span>

---

### Word context

<div class="well">
	You shall know a word by the company it keeps. (Firth 1957)<br>
	Words that occur in similar contexts tend to have similar meanings.
</div>
<div class="small">Image credit: Shubham Agarwal</div>

---

### Genomic context

<div class="well">
	A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function.
</div>

---

---

### Genomic Interval Embeddings

---

### Evaluation

We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.<br>
Do relationships among vectors reflect biology?

---

## Evaluation 1: Classification performance

---

## Evaluation 1: Classification performance

---

### Evaluation 1: Classification performance

---

### Conclusion

<ul>
	<li>Regionset2vec adapts word2vec to learn genomic region embeddings</li>
	<li>Regionset2vec embeddings capture biological information</li>
	<li>NLP approaches can be adapted for applications in genomic interval analysis</li>
</ul>
---

## Evaluation of genomic interval embeddings

<div class="well">
Assumption:<br>
Proximity in linear genome space <br>increases probability of similar function
</div>

<div class="col2">
Neighborhood preserving tests

![Neighborhood Preserving Schematic](/_modules/regionset2vec-extension/neighborhood-preserving-schematic.svg)
</div>

<div class="col2 fragment">
Results

![Neighborhood Preserving Results](/_modules/regionset2vec-extension/neighborhood-preserving-results.svg)
</div>

<div style="padding:12px; font-size: 16pt; position:absolute; top:-50px;right:-100px">
<span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px">
<img src="/_modules/regionset2vec-extension/guangtao.jpg" width="100" style="margin:0px;">
<br>Guangtao Zheng
</span>
</div>

---

## Evaluation of genomic interval embeddings

<div class="well">
Assumption:<br>
Proximity in linear genome space <br>increases probability of similar function
</div>

<div class="col2">
Grouped average distance tests

![Grouped Average Distances Schematic](/_modules/regionset2vec-extension/grouped-average-distances-schematic.svg)
</div>

<div class="col2 fragment">
Results

![Grouped Average Distances Results](/_modules/regionset2vec-extension/grouped-average-distances-results.svg)
</div>

---

## Tokenization and universe selection

<div style="padding:12px; font-size: 16pt; position:absolute; top:-50px;right:-100px">
<span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px">
<img src="/_modules/regionset2vec-extension/Julia.jpg" width="100" style="margin:0px;">
<br>Julia Rymuza
</span>
</div>

![Universe Robustness](/_modules/regionset2vec-extension/universe-robustness.svg)

---

<div style="padding:12px; font-size: 16pt; position:absolute; top:-50px;right:-100px">
<span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px">
<img src="/_modules/regionset2vec-extension/Erfaneh.jpg" width="100" style="margin:0px;">
<br>Erfaneh Gharavi
</span>
</div>

Joint representation learning <br> of genomic interval sets and metadata

![Starspace Scenario 1](/_modules/regionset2vec-extension/scenario-1.svg)

<div class="fragment">
![Starspace Method Overview](/_modules/regionset2vec-extension/method-overview.svg)
</div>

---

![Starspace Embedding Distances](/_modules/regionset2vec-extension/starspace-embedding-distances.svg)
---

Caveat: These embeddings depend critically on the universe (vocabulary)

# Universe

The set of genomic intervals that *could have* been included

How do we determine the universe?
How can we assess universe fit?

---

## Aim 3

Develop methods for building
and evaluating interval universes

Task 1:
collection of region sets → universe

Task 2:
collection of region sets + universe → fit score

---

# Task 1: Building universes

Some simple universes

---

# Task 2: Evaluating interval universes

1. Sensitivity and specificity
2. A likelihood model for universe fit
3. Start and end distance evaluation

---

# Sensitivity and specificity

sensitivity = How much of the interval set is within the universe?
specificity = How much "unused" universe is there?

---

---

# Sensitivity and specificity

Problem: this score is not sensitive to abutting regions

---

# A likelihood model for universe fit

Given a collection of region sets, and proposed universe, what is the likelihood that the proposed universe was drawn from the distribution of region sets?

---

Given a collection of $n$ region sets $\mathbf{R} = [R_1, R_2, ... R_n]$
where $R_n$ denotes a region set $R_n = [r_1, r_2, ... r_m]$

1. Build a sequential model that counts the frequency of *core* (overlap) across $\mathbf{R}$

2. For proposed Universe $\mathcal{U}_1$, we calculate the likelihood of the universe given the data.

---

The likelihood of the universe given the data:

$\mathcal{L}( \mathcal{U} | \mathbf{R} )$ = $\Pi_{i=1}^g   I \times (\pi_i^c \pi_i^b)$

Where:

$g$ is the number of bases in the genome

$\pi_i^c$ = probability of *core*
($\frac{freq_{core}(i)}{S_c}$ where $S_c= \sum_{i=1}^g  freq_{core}(i)$)

$\pi_i^b$ = probability of *background*
($\frac{freq_{background}(i)}{S_b}$ where $S_b= g - S_c$)

---

---

---

---

---

# Start and end distance evaluation

---

# Task 1: Building universes

A hidden markov model

---

---

## Acknowledgments

<div class="col3" style="font-size:.6em">
**Collaborators**
<br>Aidong Zhang
<br>Don Brown
<br>Guangtao Zheng
</div>

**Sheffield lab**
<br>Erfaneh Gharavi
<br>Kristyna Kupkova
<br>John Stubbs
<br>Bingjie Xue
<br>Jose Verdezoto
<br>Nathan LeRoy
<br>Oleksandr Khoroshevskyi
<br>Julia Rymuza
</div>

<div class="col3" style="font-size:.6em">
**Funding:**
<br><img src="/slides/2023-02-cphg-rip/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/slides/2023-02-cphg-rip/University_of_Virginia_logo_white.svg" height="40">
<br><img src="/slides/2023-02-cphg-rip/NIH_logo_black.svg" height="80">
<br>NIGMS R35-GM128636
<br>NHGRI R01-HG012558
</div>

---