<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # Representation learning of the epigenome Nathan Sheffield, PhD <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- <!-- .slide: data-background="/images/presentations/bg.svg.png" data-transition-speed="slow" --> ### Outline <style> .previewblock { float: left; width: 20px; height: 45px; margin: 0; border: none; white-space: nowrap; box-sizing: border-box; } .questionblock { float: left; width: 100%; margin: 5px 0; border: 1px solid rgba(255, 255, 255, .2); } </style> <div class="previewblock" style="width:10%">Lab intro</div> <div class="previewblock" style="width:10%"></div> <div class="previewblock" style="width:60%">Genomic interval embeddings</div> <div class="previewblock" style="width:20%"></div> <div class="previewblock" style="width:10%">|</div> <div class="previewblock" style="width:10%"></div> <div class="previewblock" style="width:60%">|</div> <div class="previewblock" style="width:20%"></div> <br clear="all"> <div class="previewblock" style="width:10%; background:#883388">10%</div> <div class="previewblock" style="width:10%; background:#333388">10%</div> <div class="previewblock" style="width:60%; background:#338833">60%</div> <div class="previewblock" style="width:20%; background:#338888">20%</div> <div class="previewblock" style="width:10%"></div> <div class="previewblock" style="width:10%">|</div> <div class="previewblock" style="width:60%"></div> <div class="previewblock" style="width:20%">|</div> <br clear="all"> <div class="previewblock" style="width:10%"></div> <div class="previewblock" style="width:10%">Background</div> <div class="previewblock" style="width:60%"></div> <div class="previewblock" style="width:20%">BEDbase</div> <div class="questionblock" style="background:#222; color:#eee; font-size: 0.6em; margin-top: 35px">◁ Questions ▷</div> --- ## Full-stack bioinformatics <img src="/_modules/full-stack-bioinformatics-teaser/pyramid_blue.svg" width="500"> --- <div class="col2"> <img src="/_modules/refgenie-teaser/refgenie_logo_light.svg" style="padding-top:25px; padding-bottom:25px; width: 350px"> <br> Reference genome manager <div class="small"> <a href="http://refgenie.databio.org">http://refgenie.databio.org</a><br> </div> <span class="small bullet"><img src="/_modules/refgenie-teaser/paper.svg" height="25" class="bullet"><a href="https://www.biorxiv.org/content/10.1093/gigascience/giz149">Stolarczyk et al. (2020).</a> <i>GigaScience</i>.</span><br/> <span class="small bullet"><img src="/_modules/refgenie-teaser/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021).</a> <i>NAR Genomics and Bioinformatics</i>.</span><br/> </div> <div class="col2"> <img src="/_modules/refgenie-teaser/refgenie_interfaces.svg" style="background:white" width="550"> <pre><code>refgenie pull hg38/bowtie2_index</code></pre> </div> --- ## GA4GH refget sequence collections standard <img src="/slides/representation-learning-epigenome/SeqCol-graphic.png" width="800"><br/> --- <span class="small bullet"> <img src="/_modules/pephub-teaser/pephub_logo_white.svg" style="width: 250px" class="bullet"> Sample metadata API</span> <img src="/_modules/pephub-teaser/pephub-center.svg" width="750"> <div class="small"> <a href="https://pephub.databio.org">https://pephub.databio.org</a><br> </div> --- <img src="/_modules/bio-motivation-genomic-intervals/genomic-intervals.svg" width="100%"><br/> <span class="fragment">A universal language of computational genomics</span> --- ## There are many sources of genomic interval data  --- ## Genomic interval data is growing  --- <img src="/slides/representation-learning-epigenome/self-organizing-map.svg" width="100%"> <br><span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://dx.doi.org/10.1101/gr.152140.112">Sheffield et al. (2013). <i>Genome Research</i>.</a></span> --- ## Collections of similar genomic intervals ## can be very powerful --- <img src="/_modules/lola-intro/LOLA-logo-white.svg" width="275" style="padding-top:25px; padding-bottom:25px"> <br> ### Locus Overlap Analysis <div class="small"> <a href="http://code.databio.org/LOLA/">http://code.databio.org/LOLA/</a><br> </div> <span class="small bullet"><a href="https://doi.org/10.1093/bioinformatics/btv612"><img src="/_modules/lola-intro/paper.svg" height="25" class="bullet">Sheffield and Bock (2016). <i>Bioinformatics</i>.</a></span><br/> <span class="small bullet"><img src="/_modules/lola-intro/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nar/gky464">Nagraj, Magee, and Sheffield (2018). <i>Nucleic Acids Research</i>.</a></span> --- <img src="/_modules/lola-intro/short-01-challenge.svg" /> --- <img src="/_modules/lola-intro/08-lola2.svg" /> --- <img src="/_modules/lola-intro/09-test.svg" /> --- ### LOLA requires comparing sets of intervals <img src="/slides/representation-learning-epigenome/subject-query.svg"> <img src="/slides/representation-learning-epigenome/geo-count-total-v2.svg" height="250"> <br>Can we improve the efficiency to enable faster,<br>larger-scale analysis? --- <div> <h3>Augmented Interval List (AIList)</h3> A novel data structure for efficiently computing overlaps <br> across genomic interval data.<br> <div class="small"> <a href="http://ailist.databio.org/">http://ailist.databio.org/</a><br> </div> </div> <span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btz407">Feng et al. (2020). <i>Bioinformatics</i>.</a></span> <br> <div> <h3>Integrated Genome Database (IGD)</h3> A high-performance search engine <br> for large-scale genomic interval datasets. <div class="small"> <a href="https://github.com/databio/IGD">https://github.com/databio/IGD</a><br> </div> </div> <span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btaa1062">Feng and Sheffield (2021). <i>Bioinformatics</i>.</a></span> --- ## But is counting overlaps ## really the right approach? --- <img src="/_modules/bio-motivation-region-embeddings/interval-similarity-overlap.svg"> What does it mean for two region sets to be similar?<br> What am I really looking for in a region set query? <div class="fragment">Overlaps makes some sense...but what about: <br> degree of overlap?</div> <div class="fragment">biological function of each region?</div> <div class="fragment">weighting of specific regions?</div> <div class="fragment">relationships among regions?</div> <div class="fragment">background context?</div> --- ### The bag-of-words model for text classification <img src="/_modules/bio-motivation-region-embeddings/bag-of-words.svg" width="100%"> <span class="fragment">What about a bag-of-intervals model for genomic intervals?</span> --- ### The bag-of-intervals model for genomic intervals <img src="/_modules/bio-motivation-region-embeddings/interval-universe.svg" width="680"> <ul class="fragment">Advantages <li>Vector representation of a region set</li> <li>Similarity metrics among vectors</li> <li>Lower space and time complexity than interval sets</li> </ul> --- ### Limitations of the bag of words vector approach <ul> <li>Sparsity</li> <li>Curse of dimensionality</li> <li>Space and time complexity are still an issue</li> <li>No concept of relationships among words</li> <code><pre style="color:#AAAAFF; text-align:center"> hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0] motel = [0 0 0 0 0 0 0 0 0 0 0 0 1 0] </pre></code> </ul> --- <div> <h3>Region-set 2 Vec</h3> Embeddings of genomic region sets <br> in lower dimensions. <div class="small"> <a href="https://github.com/databio/regionset-embedding">https://github.com/databio/regionset-embedding</a><br> </div> </div> <span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/bioinformatics/btab439">Gharavi et al. (2021). <i>Bioinformatics</i>.</a></span> --- ### Word embeddings <img src="/_modules/regionset2vec/word-vector-space-similar-words.jpg" width="680"> <div class="small">http://suriyadeepan.github.io</div> --- ### Word2vec model <img src="/_modules/regionset2vec/mikolov2013_fig1.png" width="680"> <br><span class="small bullet"><img src="/_modules/regionset2vec/paper.svg" height="25" class="bullet"><a href="https://arxiv.org/abs/1301.3781">Mikolov et al. (2013). <i>arXiv:1301.3781v3</i>.</a></span> --- ### Word context <img src="/_modules/regionset2vec/word-context.png" width="640" style="background:white"> <div class="well"> You shall know a word by the company it keeps. (Firth 1957)<br> Words that occur in similar contexts tend to have similar meanings. </div> <div class="small">Image credit: Shubham Agarwal</div> --- ### Genomic context <div class="well"> A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function. </div> --- <img src="/_modules/regionset2vec/complexity-scale.svg" width="1040"> --- ### Genomic Interval Embeddings <img src="/_modules/regionset2vec/method_detail_v3.svg" width="1040" style="background:white"> --- ### Evaluation We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.<br> Do relationships among vectors reflect biology? <div class="fragment"> <img src="/_modules/regionset2vec/method_overview_v3.svg" width="1040" style="background:white"> </div> --- ## Evaluation 1: Classification performance <img src="/_modules/regionset2vec/evaluation-classification-result.svg" width="740"> --- ## Evaluation 1: Classification performance <img src="/_modules/regionset2vec/evaluation-classification-result-2.svg" width="740"> --- ### Evaluation 1: Classification performance <img src="/_modules/regionset2vec/umap_classification.svg" width="740" style="background:white"> <div class="fragment"> <img src="/_modules/regionset2vec/umap_classification2.svg" width="740" style="background:white"> </div> --- ### Conclusion <ul> <li>Regionset2vec adapts word2vec to learn genomic region embeddings</li> <li>Regionset2vec embeddings capture biological information</li> <li>NLP approaches can be adapted for applications in genomic interval analysis</li> </ul> --- ## Region embeddings are highly tunable - Universe selection - Tokenization - Model architecture - Extent of training - Context window size - Learning rate ## This opens lots of new questions <br><span class="fragment">Can we do this with single-cell data?</span> <br><span class="fragment">How can we evaluate region embeddings?</span> <br><span class="fragment">How can we make the best embeddings?</span> <br><span class="fragment">How do we choose a good universe?</span> <br><span class="fragment">Are there better model architectures or training tasks?</span> <br><span class="fragment">Can we increase or change the data source?</span> --- Can we do this with single-cell data? --- <div> <h3>scEmbed</h3> Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings <div class="small"> <a href="https://github.com/databio/regionset-embedding">https://github.com/databio/geniml</a><br> </div> </div> <span class="small bullet"><img src="/_modules/scembed/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqae073">LeRoy et al. (2024). <i>NAR Genomics and Bioinformatics</i>.</a></span> --- ### scEmbed training pipeline <img src="/_modules/scembed/overview.svg" width="800"> --- ### Region embeddings can produce cell embeddings <img src="/_modules/scembed/region-to-cell-embeddings.svg" width="800"> --- ### scEmbed benchmarks competitively <img src="/_modules/scembed/benchmark.svg" width="800"> --- ### scEmbed is robust to severe data loss <img src="/_modules/scembed/benchmark-dropout.svg" width="800"> --- ### scEmbed uses a unique two-step approach <img src="/_modules/scembed/two-step.svg" width="800"> --- ### Projection embeds new data with pre-trained model <img src="/_modules/scembed/projection-concept.svg" width="800"> --- ### Projected embeddings look similar <br>to trained embeddings <img src="/_modules/scembed/projection-example.svg" width="800"> --- ### Projection enables multiple data flows <img src="/_modules/scembed/projection-flows.svg" width="800"> --- ### Embedding-projection gives nice cell clusters <img src="/_modules/scembed/projection-leucken.svg" width="800"> --- ### EV-projection places new data <br>in a pre-trained latent space <img src="/_modules/scembed/projection-leucken2.svg" width="800"> --- ### EV-projection allows for high-accuracy cell annotation <img src="/_modules/scembed/projection-annotation.svg" width="800"> --- ## Where we're going with scembed <div class="col2"> <h3>Atlas-scale trained model</h3> <img src="/slides/representation-learning-epigenome/atlas_umap.png" width="550"> </div> <div class="col2 fragment"> <h3>Cross-cell-type accessibility</h3> <img src="/slides/representation-learning-epigenome/atlas_cell_type_accessibility.png" width="550"> </div> --- How can we evaluate region embeddings? --- ## Methods for evaluating unsupervised <br/> vector representations of genomic regions - Cluster Tendancy Score (CTS) - Reconstruction Score (RCS) - Neighborhood preserving score (NPS) - Genome distance scaling score (GDSS) <span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqae086">Zheng et al. (2024). <i>NAR Genomics and Bioinformatics</i>.</a></span> --- ## Cluster tendancy score (CTS) <div class="well"> Assumption:<br> Embeddings that form clusters are more likely to be useful than embeddings that are diffuse. </div> <img src="/slides/representation-learning-epigenome/cluster-tendancy-score.png" width="800"> --- ## Reconstruction score (RCS) <div class="well"> Measures how well the embeddings can be used to reconstruct the full-dimensional input data. </div> <img src="/slides/representation-learning-epigenome/reconstruction-score.png" width="800"> --- ## Genome distance scaling score (GDSS) <div class="well"> Assumption:<br> Proximity in linear genome space <br>increases probability of similar function </div> <div class="col2"> <img src="/slides/representation-learning-epigenome/grouped-average-distances-schematic.svg" width="400"> </div> <div class="col2 fragment"> Results <img src="/slides/representation-learning-epigenome/grouped-average-distances-results.svg" width="400"> </div> --- ## Neighborhood preserving score (NPS) <div class="well"> Assumption:<br> Proximity in linear genome space <br>increases probability of similar function </div> <div class="col2"> <img src="/slides/representation-learning-epigenome/neighborhood-preserving-schematic.svg" width="400"> </div> <div class="col2 fragment"> Results <img src="/slides/representation-learning-epigenome/neighborhood-preserving-results.svg" width="400"> </div> --- ## Embedding eval results <img src="/slides/representation-learning-epigenome/eval-results-1.jpeg" width="800"><br/> --- ## Embedding eval results <img src="/slides/representation-learning-epigenome/eval-results-2.jpeg" width="800"><br/> --- How do we choose a good universe? --- ## Methods for constructing and evaluating <br/> consensus genomic interval sets <img src="/slides/representation-learning-epigenome/abstract.jpeg" width="800"><br/> <span class="small bullet"><img src="/slides/representation-learning-epigenome/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nar/gkae685">Rymuza et al. (2024). <i>Nucleic Acids Research</i>.</a></span> --- Are there better model architectures or training tasks? --- Joint representation learning <br> of genomic interval sets and metadata <img src="/slides/representation-learning-epigenome/scenario-1.svg" width="800"> <div class="fragment"> <img src="/slides/representation-learning-epigenome/method-overview.svg" width="800"> </div> --- <img src="/slides/representation-learning-epigenome/starspace-embedding-distances.svg" width="100%"> --- ## Atacformer model <img src="/slides/representation-learning-epigenome/atacformer-training.png" height="550"> --- ## Atacformer preliminary results <img src="/slides/representation-learning-epigenome/atacformer-preliminary.png" height="600"> --- ## Neural networks are like pigs. ## If you want them to be useful, ## you have to feed them a lot. --- Can we increase or change the data source? --- <style> .wrap { width: 1550px; height: 1000px; overflow: hidden; } iframe { width: 97% !important; height: 85% !important; -webkit-transform: scale(0.65); transform: scale(0.65); -webkit-transform-origin: 0 0; transform-origin: 0 0; } </style> <div> <h3>BEDbase</h3> A high-performance server and API <br> for genomic interval data. <br><span class="small bullet"><img src="/_modules/bedbase/web.svg" height="30" class="bullet"><a href="https://bedbase.org">bedbase.org</a></span> </div> --- <div> <img src="/_modules/bedbase/bedbase_logo.svg" width="275"> <br> A high-performance server and API <br> for genomic interval data. <div class="small"> <a href="http://bedbase.org">http://bedbase.org</a><br> </div> </div> <ul> <li>Data spans projects (*e.g.* all data on GEO; 40,000 accessions, 100,000+ BED files)</li> <li>Programmatic API for metadata, statistics, and data chunks</li> <li>Human browsing of statistical and biological attributes</li> <li>Aware of similarities among BED files</li> <li>Human-friendly search</li> <li>Shaped into 'non-redundant' sets for analysis</li> </ul> --- ### BEDbase goals - Human browsing of statistical and biological attributes - Human-friendly, *intelligent* search - Programmatic API for metadata, statistics, and data chunks - Integrative analytical results - Data spans projects (all data on GEO) --- ### BEDbase architecture <img src="/_modules/bedbase/architecture_v2.svg" width="1040" style="background:white"> --- <a href="https://doi.org/10.1038/s41597-022-01619-5"><img src="/_modules/bedbase/sheffield2022.png" style="background:white"></a> <br>BEDbase is a microservice for data interoperability,<br> not another cloud platform --- <ol> <li style="color:yellow">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- ### Human browsing of BED file splash pages <span class="small"><a href="https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1">https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1</a></span> <div class="wrap"> <iframe src="https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen ></iframe> </div> --- ### Reference genome compatibility <img src="/_modules/bedbase/ref_genome_scores_1.svg" height="530" style="background:white"> --- ### Reference genome compatibility <img src="/_modules/bedbase/ref_genome_scores.svg" height="530" style="background:white"> --- ### Reference genome compatibility <div class="col2"> <img src="/_modules/bedbase/ref_genome_scores.svg" height="400" style="background:white"> </div> <div class="col2"> <img src="/_modules/bedbase/bedbase-ref-genome-compat.png" height="530" style="background:white"> </div> --- ### BEDsets allow comparison of BED files <span class="small"><a href="https://bedbase.org/bedset/gse246900">https://bedbase.org/bedset/gse246900</a></span> <div class="wrap"> <iframe src="https://bedbase.org/bedset/gse246900" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen ></iframe> </div> --- ### Human-friendly search <span class="small">Co-embedded metadata and region sets: <a href="https://bedbase.org/search?q=brain">https://bedbase.org/search?q=brain</a></span> <div class="wrap"> <iframe src="https://bedbase.org/search?q=brain" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div> --- ### Human-friendly search <img src="/_modules/bedbase/text_search_1.svg" height="530" style="background:white"> --- ### Human-friendly search <img src="/_modules/bedbase/text_search.svg" width="1040" style="background:white"> --- ### Search by BED file <span class="small">Co-embedded metadata and region sets: <a href="https://bedbase.org/search?view=b2b">https://bedbase.org/search?view=b2b</a></span> <div class="wrap"> <iframe src="https://bedbase.org/search?view=b2b" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div> --- <ol> <li style="">Web interface (front-end)</li> <li style="color:yellow">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- # BEDbase R client <img src="/_modules/bedbase/bedbaser.png" width="100%"/> --- ``` library("bedbaser") bedbase <- BEDbase(tempdir()) bb_to_granges(bedbase, "ab446df9a043222067863cfd536ee8e0") ``` ``` GRanges object with 37 ranges and 5 metadata columns: seqnames ranges strand | name score [Rle] [IRanges] [Rle] | [character] [integer] [1] chr17 38083473-38083801 * | O-8A-H3K27ac_peak_11.. 21 [2] chr17 38108871-38110066 * | O-8A-H3K27ac_peak_11.. 25 [3] chr17 38137142-38137795 * | O-8A-H3K27ac_peak_11.. 33 [4] chr17 38210828-38211063 * | O-8A-H3K27ac_peak_11.. 17 [5] chr17 38218030-38220186 * | O-8A-H3K27ac_peak_11.. 19 ... ... ... ... . ... ... [33] chr17 38603620-38604355 * | O-8A-H3K27ac_peak_11.. 15 [34] chr17 38647047-38648053 * | O-8A-H3K27ac_peak_11.. 28 [35] chr17 38708445-38710424 * | O-8A-H3K27ac_peak_11.. 20 [36] chr17 38716283-38717201 * | O-8A-H3K27ac_peak_11.. 23 [37] chr17 38803702-38804538 * | O-8A-H3K27ac_peak_11.. 46 field8 field9 field10 [character] [character] [character] [1] 3.91024 4.82245 2.12195 [2] 4.44410 5.31600 2.51785 [3] 4.30183 6.25865 3.33754 [4] 3.94862 4.34253 1.75554 [5] 3.74929 4.57115 1.93712 ... ... ... ... [33] 3.80116 4.07741 1.55623 [34] 4.35182 5.72967 2.89524 [35] 3.88836 4.73101 2.06784 [36] 4.24210 5.08399 2.33889 [37] 5.09106 7.93609 4.68279 ------- ``` --- # BEDbase Python client <img src="/_modules/bedbase/bbclient.png" width="100%"/> --- <ol> <li style="">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="color:yellow">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- ## bedhost A FastAPI application following JAMstack philosophy.<br> <img src="/_modules/bedbase/jamstack.svg" width="100%"> <span class="fragment">JAMstack forces you to build a comprehensive API. </span> --- ### OpenAPI interface <span class="small"><a href="https://api.bedbase.org/v1/docs">https://api.bedbase.org/v1/docs</a></span> <div class="wrap"> <iframe src="https://api.bedbase.org/v1/docs" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="background: #FFFFFF;"></iframe> </div> --- ### BED info via API <span class="small"><a href="https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true">https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true</a></span> <div class="wrap"> <iframe src="https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="background: #FFFFFF;"></iframe> </div> --- <ol> <li style="">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="color:yellow">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- ## BEDbase data layer 1. BED file stored in Backblaze B2 (S3 compatible object store) - BED files n=21,438 (stats from 2025-05) - 346,071 total objects, 186.4 GB ($6/TB/month) 2. B2 interface is routed through cloudflare CDN (free egress!) 3. File metadata stored in a PostgreSQL database on AWS managed Relational Database Service ## → Minimal maintenance cost --- <ol> <li style="">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="color:yellow">Processing pipelines (infrastructure)</li> <li style="">Data served (content)</li> </ol> --- - `bbconf`: bedbase configuration object, connection to database - `bedqc`: a pipeline for QC of BED files. - `bedmaker`: a pipeline to convert non-bed files into bed files - `bedstat`: a pipeline to calculate stats for a bed file - `bedbuncher`: a pipeline to create bedsets - `bedembed`: a pipeline to create bed file embeddings --- <img src="/_modules/genomicdistributions/genomic_distributions_dark.svg" width="275" style="padding-top:5px; padding-bottom:5px"> <div class="small"> Docs: <a href="http://code.databio.org/GenomicDistributions/">http://code.databio.org/GenomicDistributions/</a><br> Code: <a href="http://github.com/databio/GenomicDistributions/">http://github.com/databio/GenomicDistributions/</a><br> <div class="bullet"> <a href="https://bioconductor.org/packages/GenomicDistributions"><img src="/_modules/genomicdistributions/bioconductor_logo_grey.svg" height="22" style="padding-top:5px; padding-bottom:5px"> bioconductor.org/packages/GenomicDistributions</a><br> </div> </div> <span class="small bullet"><img src="/_modules/genomicdistributions/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1186/s12864-022-08467-y">Kupkova et al. (2022).</a> <i>BMC Genomics</i>.</span><br/> --- <img src="/_modules/genomicdistributions/genomicdistributions_summary.svg" width="100%"> --- <ol> <li style="">Web interface (front-end)</li> <li style="">Clients (front-end)</li> <li style="">API (back-end)</li> <li style="">Database and files (back-end/infrastructure)</li> <li style="">Processing pipelines (infrastructure)</li> <li style="color:yellow">Data served (content)</li> </ol> --- <div> <img src="/_modules/geofetch/geofetch_logo.svg" width="275"><br> Connects the Gene Expression Omnibus (GEO) <br> and Sequence Read Archive (SRA) <br> with PEP format<br> </div> <br><span class="small bullet"><img src="/_modules/geofetch/web.svg" height="25" class="bullet"><a href="https://geofetch.databio.org">geofetch.databio.org</a></span> --- ``` geofetch --filter="bed|bigBed|narrowPeak|broadPeak" ``` --- ### Conclusion <ul> <li>BEDbase provides BED data for humans and machines</li> <li>Output includes statistical and biological visualization</li> <li>Upcoming human-friendly search is powerful</li> <li>Programmatic access to data chunks improve interoperability</li> </ul> --- ## Take-home points - Analysis of genomic intervals is fruitful and interesting - NLP methods are the future of genomic interval analysis - BEDbase can become the 'one-stop-shop' for interval data - Neural networks are like pigs --- <!-- .slide: id="acknowledgements" data-background="/images/presentations/bg.svg.png" --> ## Thank You <img src="/images/people/group/2022-06-01.jpg" width="600"> <div class="small bullet"> Collaborators: Aidong Zhang, Don Brown<br> Funding: R01-HG012558; R35-GM128536 </div> <br clear="all"/> <span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> · <span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> · <span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span> <div class="bullet" style="background-color:rgb(45,45,45,.65); border-radius: 25px; opacity:0.9"> <img src="/images/external/uva_dgs_logo.svg" height="65"> <img src="/images/logo/logo_databio_long.svg" height="45"> </div>