<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # Machine learning approaches for analysis of genomic regions Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- # Talk outline - Background - R01 aims - Aim 1: Interval set universes - Aim 2: Word embeddings → genomic interval embeddings - Aim 3: A Hidden Markov Model for constructing genomic interval universes --- # Motivation What does the genome *encode*? Sequence → Function --- <img alt ="Genomic intervals are a universal language of biology" style="width:1200px" src="/slides/2023-02-cphg-rip/genomic-intervals.svg"> --- <img alt="Growth of BED-like files on GEO" src="/slides/2023-02-cphg-rip/geo-bed.svg" class="whitebg" style="background:white;padding:10px"> --- <img alt ="Genomic intervals are a universal language of biology" style="width:525px;padding:15px" src="/slides/2023-02-cphg-rip/genomic-intervals.svg"><img alt="Growth of BED-like files on GEO" src="/slides/2023-02-cphg-rip/geo-bed.svg" class="whitebg" style="width:375px; background:white;padding:15px"> How can we integrate all this data? --- ## Example integration tasks - Define derived sets by computing overlaps - Identify similarity among interval sets - Given a new interval set, find the most similar existing sets - Extract experiment signal levels at a set of regions - Build meta-locus plots by averaging signal across intervals --- ## Novel methods for large-scale genomic interval comparison Aim 1: Develop more efficient algorithms for interval set comparison Aim 2: Develop and evaluate vector representations of genomic region sets Aim 3: Develop methods for building and evaluating interval universes NHGRI R01-HG012558 (2022-08 to 2026-05) --- ## Aim 1 Develop more efficient algorithms for interval set comparison --- <img src="/slides/2023-02-cphg-rip/interval-similarity-overlap.svg"> What does it mean for two region sets to be similar?<br> What am I really looking for in a region set query? <div class="fragment">Overlaps makes some sense...but what about: <br> degree of overlap?</div> <div class="fragment">weighting of specific regions?</div> <div class="fragment">relationships among regions?</div> <div class="fragment">biological function of each region?</div> --- ### The bag-of-words model for text classification <img src="/slides/2023-02-cphg-rip/bag-of-words.svg" width="100%"> <span class="fragment">What about a bag-of-intervals model for genomic intervals?</span> --- ### The bag-of-intervals model for genomic intervals <img src="/slides/2023-02-cphg-rip/interval-universe.svg" width="680"> <ul class="fragment">Advantages <li>Vector representation of a region set</li> <li>Similarity metrics among vectors</li> <li>Lower space and time complexity than interval sets</li> </ul> --- # Bloom filter A space-efficient probabilistic data structure that tests whether an element is a member of a set. A query returns either "possibly in set" or "definitely not in set". <img src="/slides/2023-02-cphg-rip/Bloom_filter.svg" width="680" style="background:white; padding:15px"> Image: wikipedia --- ### Limitations of the bag of words vector approach <ul> <li>Sparsity</li> <li>Curse of dimensionality</li> <li>Space and time complexity are still an issue</li> <li>No concept of relationships among words</li> <code><pre style="color:#AAAAFF; text-align:center"> hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0] motel = [0 0 0 0 0 0 0 0 0 0 0 0 1 0] </pre></code> </ul> --- ## Aim 2 Develop and evaluate vector representations of genomic region sets --- <div> <h3>Region-set 2 Vec</h3> Embeddings of genomic region sets <br> in lower dimensions. <div class="small"> <a href="https://github.com/databio/regionset-embedding">https://github.com/databio/regionset-embedding</a><br> </div> </div> <span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btab439">Gharavi et al. (2021). <i>Bioinformatics</i>.</a></span> --- ### Word embeddings <img src="/_modules/regionset2vec/word-vector-space-similar-words.jpg" width="680"> <div class="small">http://suriyadeepan.github.io</div> --- ### Word2vec model <img src="/_modules/regionset2vec/mikolov2013_fig1.png" width="680"> <br><span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://arxiv.org/abs/1301.3781">Mikolov et al. (2013). <i>arXiv:1301.3781v3</i>.</a></span> --- ### Word context <img src="/_modules/regionset2vec/word-context.png" width="640" style="background:white"> <div class="well"> You shall know a word by the company it keeps. (Firth 1957)<br> Words that occur in similar contexts tend to have similar meanings. </div> <div class="small">Image credit: Shubham Agarwal</div> --- ### Genomic context <div class="well"> A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function. </div> --- <img src="/_modules/regionset2vec/complexity-scale.svg" width="1040"> --- ### Genomic Interval Embeddings <img src="/_modules/regionset2vec/method_detail_v3.svg" width="1040" style="background:white"> --- ### Evaluation We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.<br> Do relationships among vectors reflect biology? <div class="fragment"> <img src="/_modules/regionset2vec/method_overview_v3.svg" width="1040" style="background:white"> </div> --- ## Evaluation 1: Classification performance <img src="/_modules/regionset2vec/evaluation-classification-result.svg" width="740"> --- ## Evaluation 1: Classification performance <img src="/_modules/regionset2vec/evaluation-classification-result-2.svg" width="740"> --- ### Evaluation 1: Classification performance <img src="/_modules/regionset2vec/umap_classification.svg" width="740" style="background:white"> <div class="fragment"> <img src="/_modules/regionset2vec/umap_classification2.svg" width="740" style="background:white"> </div> --- ### Conclusion <ul> <li>Regionset2vec adapts word2vec to learn genomic region embeddings</li> <li>Regionset2vec embeddings capture biological information</li> <li>NLP approaches can be adapted for applications in genomic interval analysis</li> </ul> --- ## Evaluation of genomic interval embeddings <div class="well"> Assumption:<br> Proximity in linear genome space <br>increases probability of similar function </div> <div class="col2"> Neighborhood preserving tests  </div> <div class="col2 fragment"> Results  </div> <div style="padding:12px; font-size: 16pt; position:absolute; top:-50px;right:-100px"> <span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px"> <img src="/_modules/regionset2vec-extension/guangtao.jpg" width="100" style="margin:0px;"> <br>Guangtao Zheng </span> </div> --- ## Evaluation of genomic interval embeddings <div class="well"> Assumption:<br> Proximity in linear genome space <br>increases probability of similar function </div> <div class="col2"> Grouped average distance tests  </div> <div class="col2 fragment"> Results  </div> <div style="padding:12px; font-size: 16pt; position:absolute; top:-50px;right:-100px"> <span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px"> <img src="/_modules/regionset2vec-extension/guangtao.jpg" width="100" style="margin:0px;"> <br>Guangtao Zheng </span> </div> --- ## Tokenization and universe selection <div style="padding:12px; font-size: 16pt; position:absolute; top:-50px;right:-100px"> <span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px"> <img src="/_modules/regionset2vec-extension/Julia.jpg" width="100" style="margin:0px;"> <br>Julia Rymuza </span> </div>  --- <div style="padding:12px; font-size: 16pt; position:absolute; top:-50px;right:-100px"> <span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px"> <img src="/_modules/regionset2vec-extension/Erfaneh.jpg" width="100" style="margin:0px;"> <br>Erfaneh Gharavi </span> </div> Joint representation learning <br> of genomic interval sets and metadata  <div class="fragment">  </div> ---  --- Caveat: These embeddings depend critically on the universe (vocabulary) # Universe <img src="/slides/2023-02-cphg-rip/interval-universe.svg" width="680"> The set of genomic intervals that *could have* been included How do we determine the universe? How can we assess universe fit? --- ## Aim 3 Develop methods for building and evaluating interval universes Task 1: collection of region sets → universe Task 2: collection of region sets + universe → fit score --- # Task 1: Building universes Some simple universes <img src="/slides/2023-02-cphg-rip/simple-universes.svg" height="400" style="background:white"> --- # Task 2: Evaluating interval universes 1. Sensitivity and specificity 2. A likelihood model for universe fit 3. Start and end distance evaluation --- # Sensitivity and specificity <img src="/slides/2023-02-cphg-rip/universe-sensitivity-concept.svg" height="300"> sensitivity = How much of the interval set is within the universe? specificity = How much "unused" universe is there? --- <img src="/slides/2023-02-cphg-rip/TF_400_plot.png" height="400"> --- # Sensitivity and specificity <img src="/slides/2023-02-cphg-rip/universe-sensitivity-concept.svg" height="300"> <img src="/slides/2023-02-cphg-rip/universe-merging-problem.svg" height="300"> Problem: this score is not sensitive to abutting regions --- # A likelihood model for universe fit Given a collection of region sets, and proposed universe, what is the likelihood that the proposed universe was drawn from the distribution of region sets? --- Given a collection of $n$ region sets $\mathbf{R} = [R_1, R_2, ... R_n]$ where $R_n$ denotes a region set $R_n = [r_1, r_2, ... r_m]$ 1. Build a sequential model that counts the frequency of *core* (overlap) across $\mathbf{R}$ 2. For proposed Universe $\mathcal{U}_1$, we calculate the likelihood of the universe given the data. --- The likelihood of the universe given the data: $\mathcal{L}( \mathcal{U} | \mathbf{R} )$ = $\Pi_{i=1}^g I \times (\pi_i^c \pi_i^b)$ Where: $g$ is the number of bases in the genome $\pi_i^c$ = probability of *core* ($\frac{freq_{core}(i)}{S_c}$ where $S_c= \sum_{i=1}^g freq_{core}(i)$) $\pi_i^b$ = probability of *background* ($\frac{freq_{background}(i)}{S_b}$ where $S_b= g - S_c$) --- <img src="/slides/2023-02-cphg-rip/likelihood-1.svg" height="600" style="background:white"> --- <img src="/slides/2023-02-cphg-rip/coverage_likelihood.svg" height="600" style="background:black"> --- <img src="/slides/2023-02-cphg-rip/likelihood-2.svg" height="600" style="background:white"> --- <img src="/slides/2023-02-cphg-rip/simple-universes-coverage.svg" height="400" style="background:white"> --- # Start and end distance evaluation <img src="/slides/2023-02-cphg-rip/universe-eval-distances.svg" height="400" style="background:white"> --- # Task 1: Building universes A hidden markov model <img src="/slides/2023-02-cphg-rip/hmm.svg" height="400" style="background:white; padding:25px"> --- <img src="/slides/2023-02-cphg-rip/TF_400_likelihood_scale.svg" height="400" style="background:white; margin:25px"> --- ## Acknowledgments <div class="col3" style="font-size:.6em"> **Collaborators** <br>Aidong Zhang <br>Don Brown <br>Guangtao Zheng </div> <div class="col3" style="font-size:.6em"> **Sheffield lab** <br>Erfaneh Gharavi <br>Kristyna Kupkova <br>John Stubbs <br>Bingjie Xue <br>Jose Verdezoto <br>Nathan LeRoy <br>Oleksandr Khoroshevskyi <br>Julia Rymuza </div> <div class="col3" style="font-size:.6em"> **Funding:** <br><img src="/slides/2023-02-cphg-rip/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/slides/2023-02-cphg-rip/University_of_Virginia_logo_white.svg" height="40"> <br><img src="/slides/2023-02-cphg-rip/NIH_logo_black.svg" height="80"> <br>NIGMS R35-GM128636 <br>NHGRI R01-HG012558 </div> --- <img src="/slides/2023-02-cphg-rip/fasib_algorithm.svg" height="400" style="background:white; margin:25px">