<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # Refgenie and refget Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- <!-- .slide: data-background="/images/presentations/bg.svg.png" data-transition-speed="slow" --> ### Outline <style> .previewblock { float: left; width: 20px; height: 45px; margin: 0; border: none; white-space: nowrap; box-sizing: border-box; } .questionblock { float: left; width: 100%; margin: 5px 0; border: 1px solid rgba(255, 255, 255, .2); } </style> <div class="previewblock" style="width:25%">Motivation</div> <div class="previewblock" style="width:25%">Refgenie</div> <div class="previewblock" style="width:30%">Refget & checksums</div> <div class="previewblock" style="width:20%">Resources</div> <div class="previewblock" style="width:25%">|</div> <div class="previewblock" style="width:25%">|</div> <div class="previewblock" style="width:30%">|</div> <div class="previewblock" style="width:20%">|</div> <br clear="all"> <div class="previewblock" style="width:25%; background:#883388">25%</div> <div class="previewblock" style="width:25%; background:#338833">25%</div> <div class="previewblock" style="width:30%; background:#338888">30%</div> <div class="previewblock" style="width:20%; background:#883333">20%</div> <div class="previewblock" style="width:25%"></div> <div class="previewblock" style="width:25%"></div> <div class="previewblock" style="width:30%"></div> <div class="previewblock" style="width:20%"></div> <br clear="all"> <div class="previewblock" style="width:25%"></div> <div class="previewblock" style="width:25%"></div> <div class="previewblock" style="width:30%"></div> <div class="previewblock" style="width:20%"></div> <div class="questionblock" style="background:#222; color:#eee; font-size: 0.6em; margin-top: 35px">◁ Questions ▷</div> --- # Refgenie and refget --- ## The problem Many tools require genome-related assets (like indexes). How should we organize these on disk? <img src="/shorts/short-refgenie/refgenie/folder_structures.svg" style="background:white" width="600"> --- ## A standard organization simplifies tool interface ``` pipeline.py --genome hg38 ``` ``` pipeline.py --bowtie2-index path/to/hg38/bowtie2-index \ --tss_annotation path/to/hg38/tss_annotation.bed \ --ensembl_anno path/to/hg38/ensembl_v86.gtf ``` --- ## Illumina's iGenomes is one answer iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*. You download a tarball of a standard structure for your genome of interest, then write tools off that. --- ## The 'central repository' approach is limited - *Not scripted.* No iGenomes for an arbitrary genome/asset. - *Not modular*. No access to individual assets. - *Not programmatic*. Can't access data/metadata via API. --- ## Refgenie solves these limitations - *Two ways to retrieve an asset.* - `build` any asset from a recipe. - `pull` any individual asset from a server - *Better discoverability*. - `list/listr` shows assets - `refgenieserver` is a browseable web interface and API - *Managed locations*. - `seek` returns the local path to assets - `add/remove` to manage your own assets --- ## Refgenie consists of 3 components <img src="/shorts/short-refgenie/refgenie/refgenie_components.svg" style="background:white" height="300"> <img src="/shorts/short-refgenie/refgenie/refgenie_server.svg" style="background:white" height="300"> --- ## Refgenie splits tasks between CLI and server <img src="/shorts/short-refgenie/refgenie/refgenie_interfaces.svg" style="background:white" width="500"> --- ## Refgenie CLI example [http://refgenie.databio.org](http://refgenie.databio.org) --- ## Refgenie implements (collection) refget-like You first have to have the `fasta` asset: ``` refgenie pull -g hg38 -a fasta ``` Then you can use `getseq`: ``` refgenie getseq -g hg38 -l chr1:50000-50400 AAACAGGTTAATCGCCACGACATAGTAGTATTTAGAGTTACTAGTAAGCCTGATGCCACTACACAATTCTAGCTTTTCTCTTTAGGATGATTGTTTCATTCAGTCTTATCTCTTTTAGAAAACATAGGAAAAAATTATTTAATAATAAAATTTAATTGGCAAAATGAAGGTATGGCTTATAAGAGTGTTTTCCTATTGTTTTCAGTGTAGGACTCACTGTTCTAAATAACTGGGACACCCAAGGATTCTGTAAAATGCCATCCAGTTATCATTTATATTCCCTAACTCAAAATTCATTCACATGTATTCATTTTTTTCTAAACAAATTAGCATGTAGAATTCTGGTTAAAATTTGGCATAGAACACCCGGGTATTTTTTCATAATGCACCCAATAACTGT ``` --- ## The build/pull method needs provenance checks <div class="col2"> ### Asset provenance: <img src="/shorts/short-refgenie/refgenie/asset_provenance.svg" style="background:white" width="300"> </div> <div class="col2"> ### Genome provenance: <img src="/shorts/short-refgenie/refgenie/genome_provenance.svg" style="background:white" width="300"> </div> --- ## Collection checksums solve genome provenance <img src="/shorts/short-refgenie/refgenie/collection_checksum.svg" style="background:white" width="800"> --- ## Recursive checksums have advantages <div class="col2"> Allows getting content list only Preserves chromosome order Re-uses the checksum function Duplicates are stored only once Go one step further for... </div> <div class="col2"> <img src="/shorts/short-refgenie/refgenie/checksum_recursion.svg" style="background:white" width="400"> </div> --- ## It keeps going... and going... <img src="/shorts/short-refgenie/refgenie/checksum_recursion_2.svg" style="background:white" width="800"> --- ## Final thoughts - Implementation of lookup algorithm: [github gist](https://gist.github.com/nsheff/3bbb96a6876234e758895e4a35c03dc7#file-refget-py-L51-L68) - If refget hosted collection checksums, then given any genome checksum, I could re-build the fasta asset for that genome automatically - Refgenieserver could provide a limited refget database for the genomes it has archived - Even without a central database, the genome checksums ensure that assets are built from the same base --- ## Resources - These slides: [databio.org/slides](https://databio.org/slides) - Refgenie documentation: [refgenie.databio.org](https://refgenie.databio.org) - Refgenieserver instance: [refgenomes.databio.org](https://refgenomes.databio.org) - GitHub: [github.com/databio/refgenie](https://github.com/databio/refgenie/) --- <style> #acknowledgements { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="acknowledgements" data-background="/images/presentations/bg.svg.png"> # Thank You <br clear="all"/> <span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> · <span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> · <span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span> <div class="bullet" style="background-color:rgb(45,45,45,.65); border-radius: 25px; opacity:0.9"> <img src="/images/external/uva_dgs_logo.svg" height="65"> <img src="/images/logo/logo_databio_long.svg" height="45"> </div> </section>