<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # Epigenomes and intervals Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- <section id="mission_statement"> # Mission statement We develop and apply computational methods<br/> to organize, analyze, and understand large epigenomic data.<br/><br/> <img src="/_modules/mission-statement/understand.svg" width="695"> <br> <img src="/_modules/mission-statement/logo_databio.svg" width="195"> </section> --- <section> # Full-stack bioinformatics <div class="col2"> <img src="/slides/intro-epigenome-analysis/pyramid_yellow.svg" width="500"><br> </div> <div class="col2"> <img src="/slides/intro-epigenome-analysis/integration.svg" width="500"><br> </div> </section> --- # Biological motivation <div class="col2"> <br><br> <img src="/_modules/bio-motivation-regulatory-dna/dna_folding_diversity.svg" width="400"><br/> Cells alter phenotype by using DNA differently. <br> </div> <div class="col2 fragment"> <img src="/_modules/bio-motivation-regulatory-dna/differentation_gone_awry.svg" width="500"><br/> Breakdowns lead to disease </div> --- ## What is epigenetics? --- <div class="well"> There has always been a place in biology for words that have different meanings for different people. Epigenetics is an extreme case, because it has several meanings with independent roots. (Bird 2007) </div> <div class="well"> The meaning of the term "epigenetics" has itself undergone an evolution. (Felsenfeld 2014) </div> --- <div class="well"> the causal study of embryological development (Waddington 1957, The strategy of the genes) </div> <img src="/_modules/epigenetics/waddington-epigenetics-hilight.png" width="1000"> --- <img src="/_modules/epigenetics/epigenetic_landscape.jpg" width="800"> --- <div class="well"> The study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence (Riggs et al. 1996) </div> <img src="/_modules/epigenetics/division.svg" width="400"> --- <div class="well"> a change in the state of expression of a gene that does not involve a mutation, but that is nevertheless inherited in the absence of the signal (or event) that initiated the change. (Ptashne and Gant 2002) </div> <img src="/_modules/epigenetics/epigenetics-demo.svg" width="1000"> --- <div class="well"> the structural adaptation of chromosomal regions so as to register, signal or perpetuate altered activity states. (Bird 2007) </div> <img src="/_modules/epigenetics/bird-epigenetics.svg" width="500"> --- ## What is epigenetics? <div class="well"> the causal study of embryological development (Waddington 1957, The strategy of the genes) </div> <div class="well"> The study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence (Riggs et al. 1996) </div> <div class="well"> a change in the state of expression of a gene that does not involve a mutation, but that is nevertheless inherited in the absence of the signal (or event) that initiated the change. (Ptashne and Gant 2002) </div> <div class="well"> the structural adaptation of chromosomal regions so as to register, signal or perpetuate altered activity states. (Bird 2007) </div> --- ## What is epigenetics? <div class="well"> Epigenetics refers to changes in gene regulation brought about through modifications to the DNA's packaging proteins or the DNA molecules themselves without changing the underlying sequence. (Lord and Cruchaga 2014, Nature Neuroscience) </div> <div class="well"> the study of the mechanisms that allow cells to translate the nearly constant genome content of a multicellular organism into multiple functional and stable cellular conditions (Schwartzman and Tanay 2015) </div> <div class="well"> Epigenetic processes are a means by which endogenous and exogenous cues exert long-term control over gene expression (Nugent et al. 2015) </div> --- ## What is epigenetics? The pop definition: <div class="well"> The word literally means "on top of genetics," and it's the study of how individual genes can be activated or deactivated by life experiences. (<i>The Week</i>, 2013) </div> --- <img src="/_modules/epigenetics/week_epigenetics.png" width="750"> --- <img src="/_modules/epigenetics/discover_epigenetics.png" width="750"> --- <img src="/_modules/epigenetics/division.svg" width="350"> <img src="/_modules/epigenetics/division_human.svg" width="350" class="fragment"> --- ## What is epigenomics? <div class="well"> epigenomics is the study of the physical modifications, associations and conformations of genomic DNA sequences (Schwartzman and Tanay 2015) </div> <div class="well"> epigenomics is the study of the chemical modification and physical conformation of cellular DNA and bound proteins (Sheffield 2017) </div> <i>The word "epigenome" lacks the baggage of heritability.</i> --- <img src="/_modules/epigenetics/rosa2013_chromatin.png" width="550"> Rosa et al. 2013 --- ## Histone variants <img src="/_modules/epigenetics/Nucleosome_structure.png" width="650" style="background:white"> https://en.wikipedia.org/wiki/Histone_octamer --- ## Histone modification (PTM) <img src="/_modules/epigenetics/Histone_modifications.png" width="1000" style="background:white"> https://en.wikipedia.org/wiki/Histone --- ## DNA Methylation <img src="/_modules/epigenetics/dnameth_intro.svg" width="750"> --- ## Chromatin conformation <img src="/_modules/epigenetics/chromatin-conformation.png" width="900"> --- Genomic intervals ---  --- ## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks --- ## Peaks  Genomic intervals are often colloquially referred to as 'peaks'. --- ## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) --- ## SNPs SNPs are interval of width 1  --- ## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc) --- ## Genes and gene components  --- ## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc) - Non-coding DNA annotation (promoters, enhancers) --- ## Non-coding DNA annotation  --- ## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc) - Non-coding DNA annotation (promoters, enhancers) --- ## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc) - Non-coding DNA annotation (promoters, enhancers) - Protein domains --- ## Protein domains  --- ## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc) - Non-coding DNA annotation (promoters, enhancers) - Protein domains - Anything else? --- # Key point <span class="fragment"> Because of the linear nature of DNA and RNA, many biological entities can be conceptualized as genomic intervals. </span> <span class="fragment">Genomic intervals are often a simplified abstraction of genomic sequence.</span> <span class="fragment"> Interval operations are fundamental in genomics </span> --- <img src="/_modules/lola-intro/LOLA-logo-white.svg" width="275" style="padding-top:25px; padding-bottom:25px"> <br> ### Locus Overlap Analysis <div class="small"> <a href="http://code.databio.org/LOLA/">http://code.databio.org/LOLA/</a><br> </div> <span class="small bullet"><img src="/_modules/lola-intro/paper.svg" height="25" class="bullet">Sheffield and Bock (2016). <i>Bioinformatics</i>.</span><br/> <span class="small bullet"><img src="/_modules/lola-intro/paper.svg" height="25" class="bullet">Nagraj, Magee, and Sheffield (2018). <i>Nucleic Acids Research</i>.</span> --- <img src="/shorts/lola/LOLA-logo-white.svg" width="275" style="padding-top:25px; padding-bottom:25px"> <br> <div class="small"> <a href="http://code.databio.org/LOLA/">http://code.databio.org/LOLA/</a><br> </div> <span style="font-size: 0.8em;"><img src="/shorts/lola/paper.svg" height="25" style="vertical-align: text-bottom; margin-right: 5px;">Sheffield and Bock (2016). *Bioinformatics*.</span><br/> <span style="font-size: 0.8em;"><img src="/shorts/lola/paper.svg" height="25" style="vertical-align: text-bottom; margin-right: 5px;">Nagraj, Magee, and Sheffield (2018). *Nucleic Acids Research*.</span> ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  --- ## LOLAweb <img src="/shorts/lola/LOLAweb-logo-white.svg" width="275" style="padding-top:25px; padding-bottom:25px"> A shiny app and server for interactive LOLA analysis. Public server: [http://lolaweb.databio.org](http://lolaweb.databio.org) GitHub: [https://github.com/databio/LOLAweb](https://github.com/databio/LOLAweb) --- ### DEMO <video controls width="800"> <source src="lw.webm" type="video/webm"> Your browser does not support the video tag. </video> --- ### Augmented Interval List (AIList) A novel data structure for efficiently computing overlaps <br> across genomic interval data.<br> <div class="small"> <a href="http://ailist.databio.org/">http://ailist.databio.org/</a><br> </div> <span class="small bullet"><img src="/_modules/ailist/icons/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btz407">Feng et al. (2020). <i>Bioinformatics</i>.</a></span> <br> <div style="padding:12px; font-size: 16pt; display:inline-block;"> <span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px"> <img src="/_modules/ailist/Feng.jpg" width="100" style="margin:0px;"> <br>Jianglin Feng </span> </div> --- # LOLA refresher  --- # LOLA requires comparing sets of intervals  Can we improve the efficiency to enable faster, larger-scale analysis? --- # If subject list has no containment, identifying overlaps is fast  <!-- .element: class="fragment" --> binary search on start intervals, followed by backward steps: <!-- .element: class="fragment" -->  --- # The problem arises with contained interval overlaps   --- # How can we improve efficiency without guaranteeing no containment? --- # Many approaches to solve the 'containment' issue: - Nested Containment Lists (GRanges) (Alekseyenko and Lee, 2007; Aboyoun, P, Pages, H, and Lawrence, 2012) - R-trees (bedtools) (Kent et al., 2002; Quinlan and Hall, 2010), Augmented interval trees (Cormen et al., 2001) These methods try to structure the data to provide non-containment guarantees --- # Methods provide non-containment guarantees <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> ### R-trees Annotates tree nodes with a *minimum bounding rectangle* of elements. A query that does not intersect the bounding rectangle will not intersect any child element. </div> <div style="width: 45%;"> ### Nested Containment Lists  </div> </div> --- # Augmented Interval List 1. Augment the list with the running maximum *end* value. *solves the problem for lowly-contained lists* 2. Decompose the list to minimize containment. *extends the solution to highly-contained lists* --- # Augment with the running maximum end value, `maxE` Provides a *local guarantee* of no containment.  --- # AIList works on contained lists   --- # But long containment runs are problematic   --- # Decompose long runs with constant `maxE`  --- # Performance - How does the `maxE` minimum run length affect performance? - How does it compare to existing approaches? - How does it scale with increasing size of subject? --- # Datasets  --- # How does the `maxE` minimum run length affect performance?  --- # How does it compare to existing approaches?  --- # How does it scale with increasing size of subject?  --- # Conclusion and Directions AIList is best-in-class for one-to-one interval comparisons --- ## Acknowledgments <div style="display: flex; justify-content: space-between;"> <div style="width: 30%; font-size: 0.6em;"> <img src="/shorts/ailist/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/shorts/ailist/University_of_Virginia_logo_white.svg" height="40"> **Sheffield lab** - John Lawson - Vince Reuter - Ognen Duzlevski - Jason Smith - **Jianglin Feng** - Michal Stolarczyk - Aaron Gu - Anant Tewari </div> <div style="width: 30%; font-size: 0.6em;"> **Funding:** <img src="/shorts/ailist/University_of_Virginia_logo_white.svg" height="40"> <img src="/shorts/ailist/NIH_logo_black.svg" height="80"> <img src="/shorts/ailist/hfsp_logo.svg" height="60"> </div> </div> --- ### Region-set 2 Vec Embeddings of genomic region sets <br> in lower dimensions. <div class="small"> <a href="https://github.com/databio/regionset-embedding">https://github.com/databio/regionset-embedding</a><br> </div> <span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btab439">Gharavi et al. (2021). <i>Bioinformatics</i>.</a></span> <br> <div style="padding:12px; font-size: 16pt; display:inline-block;"> <span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px"> <img src="/_modules/regionset2vec-extension/Erfaneh.jpg" width="100" style="margin:0px;"> <br>Erfaneh Gharavi </span> </div> --- <div> <h3>Region-set 2 Vec</h3> Embeddings of genomic region sets <br> in lower dimensions. <div class="small"> <a href="https://github.com/databio/regionset-embedding">https://github.com/databio/regionset-embedding</a><br> </div> </div> <span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btab439">Gharavi et al. (2021). <i>Bioinformatics</i>.</a></span> --- ### Word embeddings <img src="/_modules/regionset2vec/word-vector-space-similar-words.jpg" width="680"> <div class="small">http://suriyadeepan.github.io</div> --- ### Word2vec model <img src="/_modules/regionset2vec/mikolov2013_fig1.png" width="680"> <br><span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://arxiv.org/abs/1301.3781">Mikolov et al. (2013). <i>arXiv:1301.3781v3</i>.</a></span> --- ### Word context <img src="/_modules/regionset2vec/word-context.png" width="640" style="background:white"> <div class="well"> You shall know a word by the company it keeps. (Firth 1957)<br> Words that occur in similar contexts tend to have similar meanings. </div> <div class="small">Image credit: Shubham Agarwal</div> --- ### Genomic context <div class="well"> A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function. </div> --- <img src="/_modules/regionset2vec/complexity-scale.svg" width="1040"> --- ### Genomic Interval Embeddings <img src="/_modules/regionset2vec/method_detail_v3.svg" width="1040" style="background:white"> --- ### Evaluation We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.<br> Do relationships among vectors reflect biology? <div class="fragment"> <img src="/_modules/regionset2vec/method_overview_v3.svg" width="1040" style="background:white"> </div> --- ## Evaluation 1: Classification performance <img src="/_modules/regionset2vec/evaluation-classification-result.svg" width="740"> --- ## Evaluation 1: Classification performance <img src="/_modules/regionset2vec/evaluation-classification-result-2.svg" width="740"> --- ### Evaluation 1: Classification performance <img src="/_modules/regionset2vec/umap_classification.svg" width="740" style="background:white"> <div class="fragment"> <img src="/_modules/regionset2vec/umap_classification2.svg" width="740" style="background:white"> </div> --- ### Conclusion <ul> <li>Regionset2vec adapts word2vec to learn genomic region embeddings</li> <li>Regionset2vec embeddings capture biological information</li> <li>NLP approaches can be adapted for applications in genomic interval analysis</li> </ul> --- ## Acknowledgments <div class="col3" style="font-size:.6em"> **Collaborators** <br>Vince Reuter <br>Andre Rendeiro <br>Levi Waldron <br><br> **Alumni** <br>Aaron Gu <br>Jianglin Feng <br>Ognen Duzlevski <br>Tessa Danehy </div> <div class="col3" style="font-size:.6em"> **Sheffield lab** <br>Erfaneh Gharavi <br>Michal Stolarczyk <br>John Lawson <br>Jason Smith <br>Kristyna Kupkova <br>John Stubbs <br>Bingjie Xue <br>Jose Verdezoto <br>Nathan LeRoy <br>Oleksandr Khoroshevskyi </div> <div class="col3" style="font-size:.6em"> <b>Funding:</b><br> <br><img src="/_modules/ack-generic/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/_modules/ack-generic/University_of_Virginia_logo_white.svg" height="40"> <br><img src="/_modules/ack-generic/NIH_logo_black.svg" height="80"><br>NIGMS R35-GM128636 </div> --- <style> #acknowledgements { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="acknowledgements" data-background="/images/presentations/bg.svg.png"> # Thank You <br clear="all"/> <span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> · <span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> · <span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span> <div class="bullet" style="background-color:rgb(45,45,45,.65); border-radius: 25px; opacity:0.9"> <img src="/images/external/uva_dgs_logo.svg" height="65"> <img src="/images/logo/logo_databio_long.svg" height="45"> </div> </section>