<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # Recent advances in genomic interval analysis Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- <!-- .slide: data-background="/images/presentations/bg.svg.png" data-transition-speed="slow" --> ### Outline <style> .previewblock { float: left; width: 20px; height: 45px; margin: 0; border: none; white-space: nowrap; box-sizing: border-box; } .questionblock { float: left; width: 100%; margin: 5px 0; border: 1px solid rgba(255, 255, 255, .2); } </style> <div class="previewblock" style="width:25%">Augmented Interval Lists</div> <div class="previewblock" style="width:25%"></div> <div class="previewblock" style="width:40%">Regionset2vec</div> <div class="previewblock" style="width:10%"></div> <div class="previewblock" style="width:25%">|</div> <div class="previewblock" style="width:25%"></div> <div class="previewblock" style="width:40%">|</div> <div class="previewblock" style="width:10%"></div> <br clear="all"> <div class="previewblock" style="width:25%; background:#883388">25%</div> <div class="previewblock" style="width:25%; background:#333388">25%</div> <div class="previewblock" style="width:40%; background:#338888">40%</div> <div class="previewblock" style="width:10%; background:#883333">10%</div> <div class="previewblock" style="width:25%"></div> <div class="previewblock" style="width:25%">|</div> <div class="previewblock" style="width:40%"></div> <div class="previewblock" style="width:10%">|</div> <br clear="all"> <div class="previewblock" style="width:25%"></div> <div class="previewblock" style="width:25%">Integrated Genome Database</div> <div class="previewblock" style="width:40%"></div> <div class="previewblock" style="width:10%">BEDbase</div> <div class="questionblock" style="background:#222; color:#eee; font-size: 0.6em; margin-top: 35px">◁ Questions ▷</div> --- # LOLA refresher  --- # LOLA requires comparing sets of intervals  Can we improve the efficiency to enable faster, larger-scale analysis? --- # If subject list has no containment, identifying overlaps is fast  <!-- .element: class="fragment" --> binary search on start intervals, followed by backward steps: <!-- .element: class="fragment" -->  --- # The problem arises with contained interval overlaps   --- # How can we improve efficiency without guaranteeing no containment? --- # Many approaches to solve the 'containment' issue: - Nested Containment Lists (GRanges) (Alekseyenko and Lee, 2007; Aboyoun, P, Pages, H, and Lawrence, 2012) - R-trees (bedtools) (Kent et al., 2002; Quinlan and Hall, 2010), Augmented interval trees (Cormen et al., 2001) These methods try to structure the data to provide non-containment guarantees --- # Methods provide non-containment guarantees <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> ### R-trees Annotates tree nodes with a *minimum bounding rectangle* of elements. A query that does not intersect the bounding rectangle will not intersect any child element. </div> <div style="width: 45%;"> ### Nested Containment Lists  </div> </div> --- # Augmented Interval List 1. Augment the list with the running maximum *end* value. *solves the problem for lowly-contained lists* 2. Decompose the list to minimize containment. *extends the solution to highly-contained lists* --- # Augment with the running maximum end value, `maxE` Provides a *local guarantee* of no containment.  --- # AIList works on contained lists   --- # But long containment runs are problematic   --- # Decompose long runs with constant `maxE`  --- # Performance - How does the `maxE` minimum run length affect performance? - How does it compare to existing approaches? - How does it scale with increasing size of subject? --- # Datasets  --- # How does the `maxE` minimum run length affect performance?  --- # How does it compare to existing approaches?  --- # How does it scale with increasing size of subject?  --- # Conclusion and Directions AIList is best-in-class for one-to-one interval comparisons --- ## Acknowledgments <div style="display: flex; justify-content: space-between;"> <div style="width: 30%; font-size: 0.6em;"> <img src="/shorts/ailist/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/shorts/ailist/University_of_Virginia_logo_white.svg" height="40"> **Sheffield lab** - John Lawson - Vince Reuter - Ognen Duzlevski - Jason Smith - **Jianglin Feng** - Michal Stolarczyk - Aaron Gu - Anant Tewari </div> <div style="width: 30%; font-size: 0.6em;"> **Funding:** <img src="/shorts/ailist/University_of_Virginia_logo_white.svg" height="40"> <img src="/shorts/ailist/NIH_logo_black.svg" height="80"> <img src="/shorts/ailist/hfsp_logo.svg" height="60"> </div> </div> --- ## Integrated Genome Database (IGD) A high-performance search engine for large-scale genomic interval datasets. <div class="small"> <a href="https://github.com/databio/IGD">https://github.com/databio/IGD</a> </div> <span class="small bullet"><img src="/_modules/igd/icons/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btaa1062">Feng et al. (2021). <i>Bioinformatics</i>.</a></span> <div style="padding:12px; font-size: 16pt; display:inline-block;"> <span style="border: 0px solid grey; float:right; margin: 0px 4px; padding: 0px 4px"> <img src="/_modules/igd/Feng.jpg" width="100" style="margin:0px;"> <br>Jianglin Feng </span> </div> --- ## Expanding the search space <img src="/_modules/igd/subject-query-one-vs-one.svg"> <!-- .element: class="fragment" --><img src="/_modules/igd/subject-query-one-vs-many.svg"> --- ## An integrated data structure <img src="/_modules/igd/subject-query-one-vs-many-loop.svg"> <!-- .element: class="fragment" --><img src="/_modules/igd/subject-query-one-vs-many-integrate.svg"> --- ## GIGGLE GIGGLE indexes many interval sets with a B+ tree. <img src="/_modules/igd/layer2018_fig1.png" height="450"> <span class="small bullet"><img src="/_modules/igd/icons/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1038/nmeth.4556">Layer et al. (2018). <i>Nature Methods</i>.</a></span> --- ## IGD uses linear binning - The genome is divided into equal-size bins - Database intervals are placed in any bins they overlap - Intervals are sorted by start coordinate within a bin <!-- .element: class="fragment" --><img src="/_modules/igd/igd/igd_database.svg" height="400" style="background: white"> --- ## Advantages - Single-layer data structure has less overhead - Bins are independent ## Challenges - Duplication = bigger database - Duplication = possible for double-counting --- ## Challenge 1: Database size - Adjustable with bin size - In practice: 5-20% bigger than raw, unduplicated data - Can be 2x or more if you have smaller bins than regions - Default bin size: 16,384 (2<sup>14</sup>) --- ## Challenge 2: Double-counting Occurs only when both query and subject interval cross the same bin boundary. **Rule:** If the query crosses the left boundary of the bin, then any region in the bin that also crosses the left boundary will be skipped <!-- .element: class="fragment" --><img src="/_modules/igd/igd/igd_database.svg" height="300" style="background: white"> --- ## Question: Within a bin, how are overlaps calculated? <img src="/_modules/igd/igd/igd_database.svg" height="300" style="background: white"> Can we use the AIList search algorithm? <!-- .element: class="fragment" -->Yes, but it doesn't help much because the bin size restricts the excess comparisons. --- ## Performance <img src="/_modules/igd/igd/igd_performance.svg" height="400" style="background: white"> --- ## Conclusion - IGD computes overlaps between a query and database of indexed interval sets - IGD uses linear binning to index collections of region sets - Because bins are independent, IGD uses little memory, and could be parallelized - IGD reduces database size and increases performance --- <div> <h3>Region-set 2 Vec</h3> Embeddings of genomic region sets <br> in lower dimensions. <div class="small"> <a href="https://github.com/databio/regionset-embedding">https://github.com/databio/regionset-embedding</a><br> </div> </div> <span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btab439">Gharavi et al. (2021). <i>Bioinformatics</i>.</a></span> --- ### Word embeddings <img src="/_modules/regionset2vec/word-vector-space-similar-words.jpg" width="680"> <div class="small">http://suriyadeepan.github.io</div> --- ### Word2vec model <img src="/_modules/regionset2vec/mikolov2013_fig1.png" width="680"> <br><span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://arxiv.org/abs/1301.3781">Mikolov et al. (2013). <i>arXiv:1301.3781v3</i>.</a></span> --- ### Word context <img src="/_modules/regionset2vec/word-context.png" width="640" style="background:white"> <div class="well"> You shall know a word by the company it keeps. (Firth 1957)<br> Words that occur in similar contexts tend to have similar meanings. </div> <div class="small">Image credit: Shubham Agarwal</div> --- ### Genomic context <div class="well"> A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function. </div> --- <img src="/_modules/regionset2vec/complexity-scale.svg" width="1040"> --- ### Genomic Interval Embeddings <img src="/_modules/regionset2vec/method_detail_v3.svg" width="1040" style="background:white"> --- ### Evaluation We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.<br> Do relationships among vectors reflect biology? <div class="fragment"> <img src="/_modules/regionset2vec/method_overview_v3.svg" width="1040" style="background:white"> </div> --- ## Evaluation 1: Classification performance <img src="/_modules/regionset2vec/evaluation-classification-result.svg" width="740"> --- ## Evaluation 1: Classification performance <img src="/_modules/regionset2vec/evaluation-classification-result-2.svg" width="740"> --- ### Evaluation 1: Classification performance <img src="/_modules/regionset2vec/umap_classification.svg" width="740" style="background:white"> <div class="fragment"> <img src="/_modules/regionset2vec/umap_classification2.svg" width="740" style="background:white"> </div> --- ### Conclusion <ul> <li>Regionset2vec adapts word2vec to learn genomic region embeddings</li> <li>Regionset2vec embeddings capture biological information</li> <li>NLP approaches can be adapted for applications in genomic interval analysis</li> </ul> --- <style> .wrap { width: 1550px; height: 1000px; overflow: hidden; } iframe { width: 97% !important; height: 85% !important; -webkit-transform: scale(0.65); transform: scale(0.65); -webkit-transform-origin: 0 0; transform-origin: 0 0; } </style> <div> <h3>BEDbase</h3> A high-performance server and API <br> for genomic interval data. <br><span class="small bullet"><img src="/_modules/bedbase/web.svg" height="30" class="bullet"><a href="https://bedbase.org">bedbase.org</a></span> </div> --- <div> <img src="/_modules/bedbase/bedbase_logo.svg" width="275"> <br> A high-performance server and API <br> for genomic interval data. <div class="small"> <a href="http://bedbase.org">http://bedbase.org</a><br> </div> </div> <ul> <li>Data spans projects (*e.g.* all data on GEO; 40,000 accessions, 100,000+ BED files)</li> <li>Programmatic API for metadata, statistics, and data chunks</li> <li>Human browsing of statistical and biological attributes</li> <li>Aware of similarities among BED files</li> <li>Human-friendly search</li> <li>Shaped into 'non-redundant' sets for analysis</li> </ul> --- ### BEDbase goals - Human browsing of statistical and biological attributes - Human-friendly, *intelligent* search - Programmatic API for metadata, statistics, and data chunks - Integrative analytical results - Data spans projects (all data on GEO) --- ### BEDbase architecture <img src="/_modules/bedbase/architecture_v2.svg" width="1040" style="background:white"> --- <a href="https://doi.org/10.1038/s41597-022-01619-5"><img src="/_modules/bedbase/sheffield2022.png" style="background:white"></a> <br>BEDbase is a microservice for data interoperability,<br> not another cloud platform --- <ol> <li style="color:yellow">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- ### Human browsing of BED file splash pages <span class="small"><a href="https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1">https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1</a></span> <div class="wrap"> <iframe src="https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen ></iframe> </div> --- ### Reference genome compatibility <img src="/_modules/bedbase/ref_genome_scores_1.svg" height="530" style="background:white"> --- ### Reference genome compatibility <img src="/_modules/bedbase/ref_genome_scores.svg" height="530" style="background:white"> --- ### Reference genome compatibility <div class="col2"> <img src="/_modules/bedbase/ref_genome_scores.svg" height="400" style="background:white"> </div> <div class="col2"> <img src="/_modules/bedbase/bedbase-ref-genome-compat.png" height="530" style="background:white"> </div> --- ### BEDsets allow comparison of BED files <span class="small"><a href="https://bedbase.org/bedset/gse246900">https://bedbase.org/bedset/gse246900</a></span> <div class="wrap"> <iframe src="https://bedbase.org/bedset/gse246900" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen ></iframe> </div> --- ### Human-friendly search <span class="small">Co-embedded metadata and region sets: <a href="https://bedbase.org/search?q=brain">https://bedbase.org/search?q=brain</a></span> <div class="wrap"> <iframe src="https://bedbase.org/search?q=brain" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div> --- ### Human-friendly search <img src="/_modules/bedbase/text_search_1.svg" height="530" style="background:white"> --- ### Human-friendly search <img src="/_modules/bedbase/text_search.svg" width="1040" style="background:white"> --- ### Search by BED file <span class="small">Co-embedded metadata and region sets: <a href="https://bedbase.org/search?view=b2b">https://bedbase.org/search?view=b2b</a></span> <div class="wrap"> <iframe src="https://bedbase.org/search?view=b2b" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div> --- <ol> <li style="">Web interface (front-end)</li> <li style="color:yellow">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- # BEDbase R client <img src="/_modules/bedbase/bedbaser.png" width="100%"/> --- ``` library("bedbaser") bedbase <- BEDbase(tempdir()) bb_to_granges(bedbase, "ab446df9a043222067863cfd536ee8e0") ``` ``` GRanges object with 37 ranges and 5 metadata columns: seqnames ranges strand | name score [Rle] [IRanges] [Rle] | [character] [integer] [1] chr17 38083473-38083801 * | O-8A-H3K27ac_peak_11.. 21 [2] chr17 38108871-38110066 * | O-8A-H3K27ac_peak_11.. 25 [3] chr17 38137142-38137795 * | O-8A-H3K27ac_peak_11.. 33 [4] chr17 38210828-38211063 * | O-8A-H3K27ac_peak_11.. 17 [5] chr17 38218030-38220186 * | O-8A-H3K27ac_peak_11.. 19 ... ... ... ... . ... ... [33] chr17 38603620-38604355 * | O-8A-H3K27ac_peak_11.. 15 [34] chr17 38647047-38648053 * | O-8A-H3K27ac_peak_11.. 28 [35] chr17 38708445-38710424 * | O-8A-H3K27ac_peak_11.. 20 [36] chr17 38716283-38717201 * | O-8A-H3K27ac_peak_11.. 23 [37] chr17 38803702-38804538 * | O-8A-H3K27ac_peak_11.. 46 field8 field9 field10 [character] [character] [character] [1] 3.91024 4.82245 2.12195 [2] 4.44410 5.31600 2.51785 [3] 4.30183 6.25865 3.33754 [4] 3.94862 4.34253 1.75554 [5] 3.74929 4.57115 1.93712 ... ... ... ... [33] 3.80116 4.07741 1.55623 [34] 4.35182 5.72967 2.89524 [35] 3.88836 4.73101 2.06784 [36] 4.24210 5.08399 2.33889 [37] 5.09106 7.93609 4.68279 ------- ``` --- # BEDbase Python client <img src="/_modules/bedbase/bbclient.png" width="100%"/> --- <ol> <li style="">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="color:yellow">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- ## bedhost A FastAPI application following JAMstack philosophy.<br> <img src="/_modules/bedbase/jamstack.svg" width="100%"> <span class="fragment">JAMstack forces you to build a comprehensive API. </span> --- ### OpenAPI interface <span class="small"><a href="https://api.bedbase.org/v1/docs">https://api.bedbase.org/v1/docs</a></span> <div class="wrap"> <iframe src="https://api.bedbase.org/v1/docs" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="background: #FFFFFF;"></iframe> </div> --- ### BED info via API <span class="small"><a href="https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true">https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true</a></span> <div class="wrap"> <iframe src="https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="background: #FFFFFF;"></iframe> </div> --- <ol> <li style="">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="color:yellow">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- ## BEDbase data layer 1. BED file stored in Backblaze B2 (S3 compatible object store) - BED files n=21,438 (stats from 2025-05) - 346,071 total objects, 186.4 GB ($6/TB/month) 2. B2 interface is routed through cloudflare CDN (free egress!) 3. File metadata stored in a PostgreSQL database on AWS managed Relational Database Service ## → Minimal maintenance cost --- <ol> <li style="">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="color:yellow">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- - `bbconf`: bedbase configuration object, connection to database - `bedqc`: a pipeline for QC of BED files. - `bedmaker`: a pipeline to convert non-bed files into bed files - `bedstat`: a pipeline to calculate stats for a bed file - `bedbuncher`: a pipeline to create bedsets - `bedembed`: a pipeline to create bed file embeddings --- <img src="/_modules/genomicdistributions/genomic_distributions_dark.svg" width="275" style="padding-top:5px; padding-bottom:5px"> <div class="small"> Docs: <a href="http://code.databio.org/GenomicDistributions/">http://code.databio.org/GenomicDistributions/</a><br> Code: <a href="http://github.com/databio/GenomicDistributions/">http://github.com/databio/GenomicDistributions/</a><br> <div class="bullet"> <a href="https://bioconductor.org/packages/GenomicDistributions"><img src="/_modules/genomicdistributions/bioconductor_logo_grey.svg" height="22" style="padding-top:5px; padding-bottom:5px"> bioconductor.org/packages/GenomicDistributions</a><br> </div> </div> <span class="small bullet"><img src="/_modules/genomicdistributions/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1186/s12864-022-08467-y">Kupkova et al. (2022).</a> <i>BMC Genomics</i>.</span><br/> --- <img src="/_modules/genomicdistributions/genomicdistributions_summary.svg" width="100%"> --- <ol> <li style="">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="color:yellow">Data served (content)</li> </ol> --- <div> <img src="/_modules/geofetch/geofetch_logo.svg" width="275"><br> Connects the Gene Expression Omnibus (GEO) <br> and Sequence Read Archive (SRA) <br> with PEP format<br> </div> <br><span class="small bullet"><img src="/_modules/geofetch/web.svg" height="25" class="bullet"><a href="https://geofetch.databio.org">geofetch.databio.org</a></span> --- ``` geofetch --filter="bed|bigBed|narrowPeak|broadPeak" ``` --- ### Conclusion <ul> <li>BEDbase provides BED data for humans and machines</li> <li>Output includes statistical and biological visualization</li> <li>Upcoming human-friendly search is powerful</li> <li>Programmatic access to data chunks improve interoperability</li> </ul> --- <section id="acknowledgements" data-background="/images/bg.svg.png"> ## Thank You <div class="col3" style="font-size:.6em"> <b>Collaborators</b> <br>Aakrosh Ratan <br>Aidong Zhang <br>Guangtao Zheng <br>Don Brown <br>Hyun Jae Cho <br>Vince Carey <br>Mikhail Dozmorov <br><br> <b>Alumni</b> <br>Aaron Gu <br>Jianglin Feng <br>Ognen Duzlevski <br>Tessa Danehy </div> <div class="col3" style="font-size:.6em"> <b>Sheffield lab</b> <br>Erfaneh Gharavi <br>Michal Stolarczyk <br>John Lawson <br>Jason Smith <br>Kristyna Kupkova <br>John Stubbs <br>Bingjie Xue <br>Jose Verdezoto <br>Nathan LeRoy <br>Oleksandr Khoroshevskyi </div> <div class="col3" style="font-size:.6em"> <b>Funding:</b><br> <br><img src="/slides/genomic-intervals/logo/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/slides/genomic-intervals/logo/University_of_Virginia_logo_white.svg" height="40"> <br><img src="/slides/genomic-intervals/logo/NIH_logo_black.svg" height="80"><br>NIGMS R35-GM128636 </div> <br clear="all"/> <span class="small bullet"><a href="http://github.com/nsheff">nsheff</a></span> · <span class="small bullet"><a href="http://databio.org">databio.org</a></span> · <span class="small bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span><br> <br> <div class="bullet" style="background-color:rgb(45,45,45,.65); border-radius: 25px; opacity:0.9"><img src="/slides/genomic-intervals/logo/uva_dgs_logo.svg" height="65"><img src="/slides/genomic-intervals/logo/logo_databio_long.svg" height="45"></div> </section>