<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # Choose your own adventure through Sheffield lab research Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- <section id="mission_statement"> # Mission statement We develop and apply computational methods<br/> to organize, analyze, and understand large epigenomic data.<br/><br/> <img src="/_modules/mission-statement/understand.svg" width="695"> <br> <img src="/_modules/mission-statement/logo_databio.svg" width="195"> </section> --- # Biological motivation <div class="col2"> <br><br> <img src="/_modules/bio-motivation-regulatory-dna/dna_folding_diversity.svg" width="400"><br/> Cells alter phenotype by using DNA differently. <br> </div> <div class="col2 fragment"> <img src="/_modules/bio-motivation-regulatory-dna/differentation_gone_awry.svg" width="500"><br/> Breakdowns lead to disease </div> --- <!-- .slide: data-background="/images/presentations/bg.svg.png" --> # Full-stack bioinformatics <div class="col2"> <img src="/_modules/full-stack-bioinformatics/pyramid_blue.svg" width="500"><br> </div> --- # Full-stack bioinformatics <div class="col2"> <img src="/_modules/full-stack-bioinformatics/pyramid_infrastructure.svg" width="500"><br> </div> <div class="col2"> <img src="/_modules/full-stack-bioinformatics/pep_center_white.svg" width="600"> <img src="/_modules/full-stack-bioinformatics/looper_logo.svg" width="350"> <img src="/_modules/full-stack-bioinformatics/pypiper_logo_dark.svg" width="350"> </div> --- # Full-stack bioinformatics <div class="col2"> <img src="/_modules/full-stack-bioinformatics/pyramid_databases.svg" width="500"><br> </div> <div class="col2"> <br><br> <img src="/_modules/full-stack-bioinformatics/refgenie_logo_light.svg" width="350"><br> <img src="/_modules/full-stack-bioinformatics/bedbase_logo.svg" width="550"><br> </div> --- # Full-stack bioinformatics <div class="col2"> <img src="/_modules/full-stack-bioinformatics/pyramid_algorithms.svg" width="500"><br> </div> <div class="col2"> Augmented Interval List <img src="/_modules/full-stack-bioinformatics/ailist_summary.svg" width="450"><br> </div> --- # Full-stack bioinformatics <div class="col2"> <img src="/_modules/full-stack-bioinformatics/pyramid_methods.svg" width="500"><br> </div> <div class="col2"> <br> <img src="/_modules/full-stack-bioinformatics/cocoa-summary.svg" width="600"><br> <img src="/_modules/full-stack-bioinformatics/LOLA-logo-white.svg" width="400"><br> </div> --- # Full-stack bioinformatics <div class="col2"> <img src="/_modules/full-stack-bioinformatics/pyramid_analysis.svg" width="500"><br> </div> <div class="col2"> Analysis of DNA methylation in Ewing sarcoma<br/> <img src="/_modules/full-stack-bioinformatics/analysis_ews.svg" height="450"><br> </div> --- # Full-stack bioinformatics <div class="col2"> <img src="/_modules/full-stack-bioinformatics/pyramid_yellow.svg" width="500"><br> </div> <div class="col2"> <img src="/_modules/full-stack-bioinformatics/integration.svg" width="500"><br> </div> --- <section data-markdown id="nav"> # Recent projects in the Sheffield lab - Analysis methods for genomic region sets - [LOLA](/slides/adventure.html#/LOLA): Enrichment of genomic ranges - [MIRA](/slides/adventure.html#/MIRA): DNA methyation inference of regulatory activity - [COCOA](/slides/adventure.html#/COCOA): Covariation analysis of epigenetic heterogeneity - [PEPATAC](/slides/adventure.html#/PEPATAC): Serial alignments in ATAC-seq data processing - Scientific computing and metadata management - [Refgenie](/slides/adventure.html#/refgenie): Standardizing reference assembly assets - [PEP](/slides/adventure.html#/pepkit): A management structure for sample metadata - [bulker](/slides/adventure.html#/bulker): Using docker containers for desktop computing </section> --- <section data-markdown> # Bonus (unpublished) topics - [bedbase.org](http://bedbase.org) </section> --- <img src="/_modules/lola-intro/LOLA-logo-white.svg" width="275" style="padding-top:25px; padding-bottom:25px"> <br> ### Locus Overlap Analysis <div class="small"> <a href="http://code.databio.org/LOLA/">http://code.databio.org/LOLA/</a><br> </div> <span class="small bullet"><img src="/_modules/lola-intro/paper.svg" height="25" class="bullet">Sheffield and Bock (2016). <i>Bioinformatics</i>.</span><br/> <span class="small bullet"><img src="/_modules/lola-intro/paper.svg" height="25" class="bullet">Nagraj, Magee, and Sheffield (2018). <i>Nucleic Acids Research</i>.</span> --- <img src="/shorts/lola/LOLA-logo-white.svg" width="275" style="padding-top:25px; padding-bottom:25px"> <br> <div class="small"> <a href="http://code.databio.org/LOLA/">http://code.databio.org/LOLA/</a><br> </div> <span style="font-size: 0.8em;"><img src="/shorts/lola/paper.svg" height="25" style="vertical-align: text-bottom; margin-right: 5px;">Sheffield and Bock (2016). *Bioinformatics*.</span><br/> <span style="font-size: 0.8em;"><img src="/shorts/lola/paper.svg" height="25" style="vertical-align: text-bottom; margin-right: 5px;">Nagraj, Magee, and Sheffield (2018). *Nucleic Acids Research*.</span> ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  --- ## LOLAweb <img src="/shorts/lola/LOLAweb-logo-white.svg" width="275" style="padding-top:25px; padding-bottom:25px"> A shiny app and server for interactive LOLA analysis. Public server: [http://lolaweb.databio.org](http://lolaweb.databio.org) GitHub: [https://github.com/databio/LOLAweb](https://github.com/databio/LOLAweb) --- ### DEMO <video controls width="800"> <source src="lw.webm" type="video/webm"> Your browser does not support the video tag. </video> --- ## Methylation-based Inference of Regulatory Activity (MIRA) <div class="small"> <a href="http://code.databio.org/MIRA/">http://code.databio.org/MIRA/</a><br> </div> <span class="small bullet"><img src="/_modules/mira/icons/paper.svg" height="25" class="bullet">Lawson et al. (2018). <i>Bioinformatics</i>.</span> --- <!-- .slide: class="center" --> <img src="/_modules/project-cover/{{project_logo}}" width="275" style="padding-top:25px; padding-bottom:25px"> <br> <div class="small"> <a href="{{project_url}}">{{project_url}}</a><br> </div> <span class="small bullet"><img src="/_modules/project-cover/icons/paper.svg" height="25" class="bullet" style="vertical-align: text-bottom; margin-right: 5px;">{{project_citations}}</span> --- ## MIRA concept <img src="/_modules/mira/mira.svg" /> --- ## MIRA workflow <img src="/_modules/mira/mira2.svg" /> --- ## MIRA analysis <img src="/_modules/mira/mira3.svg" /> --- ## MIRA results: Differential activity <img src="/_modules/mira/ews_pat/MIRA_result_1.svg" width="700"/> <span class="small bullet"><img src="/_modules/mira/icons/paper.svg" height="25" class="bullet">Sheffield et al. (2017). <i>Nature Medicine</i>.</span> --- ## MIRA results: Activity scores <img src="/_modules/mira/ews_pat/MIRA_result_2.svg" /> --- ## MIRA results: Enrichment analysis <img src="/_modules/mira/ews_pat/MIRA_result_3.svg" width="800"/> --- ## Coordinate Covariation Analysis (COCOA) <img src="/_modules/cocoa/cocoa_logo_light.svg" width="225" style="padding-top:25px; padding-bottom:25px"> <br> <div class="small"> <a href="http://code.databio.org/COCOA/">http://code.databio.org/COCOA/</a><br> </div> <span class="small bullet"><img src="/_modules/genomicdistributions/paper.svg" height="25" class="bullet">Lawson et al. (2020). <i>Genome Biology</i>.</span> --- <style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # Choose your own adventure through Sheffield lab research Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- @title --- @cocoa --- ## Acknowledgments <div class="col3" style="font-size:.6em"> <img src="/slides/cocoa/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/slides/cocoa/University_of_Virginia_logo_white.svg" height="40"> **Collaborators** - Fran Garrett-Bakelman - Stefan Bekiranov </div> <div class="col3" style="font-size:.6em"> **Sheffield lab** - **John Lawson** - **Jason Smith** - Jianglin Feng - Michal Stolarczyk - Kristyna Kupkova - Aaron Gu - Jose Verdezoto - Tessa Danehy </div> <div class="col3" style="font-size:.6em"> **Funding:** <img src="/slides/cocoa/University_of_Virginia_logo_white.svg" height="40"> UVA Cancer Center <img src="/slides/cocoa/NIH_logo_black.svg" height="80"> </div> --- ## Acknowledgments <div class="col3" style="font-size:.6em"> <img src="/slides/cocoa/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/slides/cocoa/University_of_Virginia_logo_white.svg" height="40"> **Collaborators** - Fran Garrett-Bakelman - Stefan Bekiranov </div> <div class="col3" style="font-size:.6em"> **Sheffield lab** - **John Lawson** - **Jason Smith** - Jianglin Feng - Michal Stolarczyk - Kristyna Kupkova - Aaron Gu - Jose Verdezoto - Tessa Danehy </div> <div class="col3" style="font-size:.6em"> **Funding:** <img src="/slides/cocoa/University_of_Virginia_logo_white.svg" height="40"> UVA Cancer Center <img src="/slides/cocoa/NIH_logo_black.svg" height="80"> </div> --- <div> <img src="/_modules/pepatac/logo_pepatac.svg" width="175" style="padding-top:25px; padding-bottom:25px"> <br> A robust ATAC-seq pipeline <br> built on the PEP toolkit <div class="small"> <a href="http://code.databio.org/PEPATAC">http://code.databio.org/PEPATAC</a><br> </div> </div> --- <img src="/_modules/pepatac/pepatac_workflow_white.svg" width="600"> --- ### Comparison <img src="/_modules/pepatac/computation_comparison.svg" width="800"> --- ### Prealignments Nuclear-mitochondrial DNA (NuMts) confuse aligners --- <img src="/_modules/pepatac/numts.svg" width="800"> <img src="/_modules/pepatac/numts_simultaneous_alignment.svg" width="800" class="fragment"> <img src="/_modules/pepatac/numts_alignment.svg" width="800" class="fragment"> --- <img src="/_modules/pepatac/numts.svg" width="800"> <img src="/_modules/pepatac/numts_alignment_problems.svg" width="800"> <img src="/_modules/pepatac/numts_alignment.svg" width="800"> --- <img src="/_modules/pepatac/numts.svg" width="800"> <img src="/_modules/pepatac/numts_alignment_problems.svg" width="800"> <img src="/_modules/pepatac/numts_alignment_blacklist.svg" width="800"> --- <img src="/_modules/pepatac/numts_alignment_blacklist.svg" width="800"> <div> <li>Inaccurate alignment statistics</li> <li>Requires pre-defined NuMt locations</li> <li>Wastes compute power</li> </div> --- <img src="/_modules/pepatac/prealignments.svg" width="800"> --- <img src="/_modules/pepatac/prealignments2.svg" width="800"> --- ### Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies. --- <img src="/_modules/pepatac/prealignment_speed.svg" width="800"> --- ### Output <img src="/_modules/pepatac/pepatac_output_sample.svg" width="650"> --- <img src="/_modules/pepatac/pepatac_summary.png" width="800"><br> <a href="http://code.databio.org/PEPATAC/files/examples/gold/summary.html">http://code.databio.org/PEPATAC/files/examples/gold/summary.html</a> --- <iframe src="http://code.databio.org/PEPATAC/files/examples/gold/summary.html" width="100%" height="675"></iframe> --- <img src="/_modules/refgenie/refgenie_logo_light.svg" style="padding-top:25px; padding-bottom:25px; width: 350px"> <br> A full-service reference genome manager. <div class="small"> <a href="http://refgenie.databio.org">http://refgenie.databio.org</a><br> </div> <span class="small bullet"><img src="/_modules/refgenie/paper.svg" height="25" class="bullet"><a href="https://www.biorxiv.org/content/10.1093/gigascience/giz149">Stolarczyk et al. (2020).</a> <i>GigaScience</i>.</span><br/> <span class="small bullet"><img src="/_modules/refgenie/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021).</a> <i>NAR Genomics and Bioinformatics</i>.</span><br/> --- ## The problem Many tools require genome-related assets (like indexes). <br>How should we organize these on disk?<br> <img src="/_modules/refgenie/folder_structures.svg" style="background:white" width="600"> --- ## A standard organization simplifies tool interface Flexible paths must be passed individually: ``` pipeline.py --bowtie2-index path/to/hg38/bowtie2-index \ --tss_annotation path/to/hg38/tss_annotation.bed \ --ensembl_anno path/to/hg38/ensembl_v86.gtf ``` A standard establishes expectations: ``` pipeline.py --genome hg38 ``` --- ## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*. You download a tarball of a standard structure for your genome of interest, then write tools off that. --- ## The 'central repository' approach is limited - *Not scripted.* No iGenomes for an arbitrary genome/asset. - *Not modular*. No access to individual assets. - *Not programmatic*. Can't access data/metadata via API. --- ## Refgenie solves these limitations - *Two ways to retrieve an asset.* - `build` any asset from a recipe. - `pull` any individual asset from a server - *Better discoverability and modularity*. - `list/listr` shows assets - `refgenieserver` is a browseable web interface and API - *Managed locations*. - `seek` returns the local path to assets - `add/remove` to manage your own assets --- ## Refgenie consists of 3 components <img src="/_modules/refgenie/refgenie_components.svg" style="background:white" height="350"> <img src="/_modules/refgenie/refgenie_server.svg" style="background:white" height="350" class="fragment"> --- ## Refgenie splits tasks between CLI and server <img src="/_modules/refgenie/refgenie_logo_light.svg" style="padding-top:25px; padding-bottom:25px; width: 350px"> <img src="/_modules/refgenie/refgenie_interfaces.svg" style="background:white" width="550"> --- ## Refgenie CLI example [http://refgenie.databio.org](http://refgenie.databio.org) --- ## The build/pull method needs provenance checks <div class="col2"> ### Asset provenance: <img src="/_modules/refgenie/asset_provenance.svg" style="background:white" width="300"> </div> <div class="col2"> ### Genome provenance: <img src="/_modules/refgenie/genome_provenance.svg" style="background:white" width="300"> </div> --- <div> ### Refget Refget enables access to reference sequences <br> using an identifier derived from the sequence itself. <div class="small"> <a href="http://samtools.github.io/hts-specs/refget.html">http://samtools.github.io/hts-specs/refget.html</a><br> <br><br>From GA4GH </div> </div> --- ## How refget works <img src="/_modules/refgenie/refget_concept.svg" style="background:white" width="500"> --- ## Refgenie implements (collection) refget-like You first have to have the `fasta` asset: ``` refgenie pull -g hg38 -a fasta ``` Then you can use `getseq`: ``` refgenie getseq -g hg38 -l chr1:50000-50400 AAACAGGTTAATCGCCACGACATAGTAGTATTTAGAGTTACTAGTAAGCCTGATGCCACTACACAATTCTAGCTTTTCTCTTTAGGATGATTGTTTCATTCAGTCTTATCTCTTTTAGAAAACATAGGAAAAAATTATTTAATAATAAAATTTAATTGGCAAAATGAAGGTATGGCTTATAAGAGTGTTTTCCTATTGTTTTCAGTGTAGGACTCACTGTTCTAAATAACTGGGACACCCAAGGATTCTGTAAAATGCCATCCAGTTATCATTTATATTCCCTAACTCAAAATTCATTCACATGTATTCATTTTTTTCTAAACAAATTAGCATGTAGAATTCTGGTTAAAATTTGGCATAGAACACCCGGGTATTTTTTCATAATGCACCCAATAACTGT ``` --- ## Refget v2.0: Collections for genome provenance <img src="/_modules/refgenie/collection_checksum.svg" style="background:white" width="800"> --- ## Recursive checksums have advantages <div class="col2"> Allows getting content list only<br><br> Preserves chromosome order<br><br> Re-uses the checksum function<br><br> Duplicates are stored only once<br><br> Go one step further for...<br><br> </div> <div class="col2"> <img src="/_modules/refgenie/checksum_recursion.svg" style="background:white" width="400"> </div> --- ## It keeps going... and going... <img src="/_modules/refgenie/checksum_recursion_2.svg" style="background:white" width="800"> --- <div class="col2"> ### Asset provenance: <img src="/_modules/refgenie/asset_provenance.svg" style="background:white" width="300"> <br>Recipes + containers? </div> <div class="col2"> ### Genome provenance: <img src="/_modules/refgenie/genome_provenance.svg" style="background:white" width="300"> <br>Solved by refget v2.0? </div> --- ## Tying human identifiers to a digest: ``` hg38: refget_digest: 32a37a52a377d95bfd4b3d66763e1396a4480f34ab5c318a ``` --- <style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # Choose your own adventure through Sheffield lab research Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- ## PEP: Portable Encapsulated Projects <img src="/_modules/pep-format/pep_center_white.svg" width="700"> --- <div class="bullet"> <h2><img src="/_modules/pep-format/pep_logo.svg" width="70">PEP format</h2> </div> Start with a simple CSV with tabular data. <hr> <div class="bullet"> <img src="/_modules/pep-format/file.svg" width="30">samples.csv </div> ``` sample_name,protocol,organism,input_file frog_0h,RNA-seq,frog,/path/to/frog0.gz frog_1h,RNA-seq,frog,/path/to/frog1.gz frog_2h,RNA-seq,frog,/path/to/frog2.gz frog_3h,RNA-seq,frog,/path/to/frog3.gz ``` --- <div class="bullet"> <h2><img src="/_modules/pep-format/pep_logo.svg" width="70">PEP format</h2> </div> Add a YAML for project-level data. <hr> <div class="bullet"> <img src="/_modules/pep-format/file.svg" width="30">samples.csv </div> ``` sample_name,protocol,organism,input_file frog_0h,RNA-seq,frog,/path/to/frog0.gz frog_1h,RNA-seq,frog,/path/to/frog1.gz frog_2h,RNA-seq,frog,/path/to/frog2.gz frog_3h,RNA-seq,frog,/path/to/frog3.gz ``` <hr> <div class="bullet"> <img src="/_modules/pep-format/file.svg" width="30">project_config.yaml </div> ```yaml sample_table: /path/to/samples.csv output_dir: /path/to/output/folder other_variable: value ``` --- ### Add programmatic sample and project modifiers. <div style="text-align: left"> <span class="bullet"><img src="/_modules/pep-format/replace_white.svg" width="50" class="bullet">Derived attributes</span><br> <span class="bullet"><img src="/_modules/pep-format/implies_white.svg" width="50" class="bullet">Implied attributes</span><br> <span class="bullet"><img src="/_modules/pep-format/subproject_white.svg" width="50" class="bullet">Subprojects</span><br> </div> --- <span class="bullet"><img src="/_modules/pep-format/replace_white.svg" width="50" class="bullet">Derived attributes</span><br> <div class="well">Automatically build new sample attributes from existing attributes.</div> Without derived attribute: | sample_name | t | protocol | organism | input_file | |-------------|---|:--------:|----------|------------| | frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz | | frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz | | frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz | | frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz | Using derived attribute: | sample_name | t | protocol | organism | input_file | |-------------|---|:--------:|----------|------------| | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples | --- | sample_name | t | protocol | organism | input_file | |-------------|---|:--------:|----------|------------| | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples | Project config file: ```yaml sample_modifiers: derive: attributes: [input_file] sources: my_samples: "/path/to/my/samples/{organism}_{t}h.gz" your_samples: "/path/to/your/samples/{organism}_{t}h.gz" ``` {variable} identifies sample annotation columns <div class="well">Benefit: Enables distributed files, portability</div> --- <span class="bullet"><img src="/_modules/pep-format/implies_white.svg" width="50" class="bullet">Implied attributes</span><br> <div class="well">Add new sample attributes conditioned on values of existing attributes</div> <div class="col2"> Before:<br> | sample_name | protocol | organism | |-------------|:--------:|----------| | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse | </div> <div class="col2"> After:<br> | sample_name | protocol | organism | genome | |-------------|:--------:|----------|--------| | human_1 | RNA-seq | human | hg38 | | human_2 | RNA-seq | human | hg38 | | human_3 | RNA-seq | human | hg38 | | mouse_1 | RNA-seq | mouse | mm10 | </div> --- | sample_name | protocol | organism | |-------------|:--------:|----------| | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse | Project config file: ```yaml sample_modifiers: imply: - if: organism: human then: genome: hg38 - if: organism: mouse then: genome: mm10 ``` <div class="well">Benefit: Divides project from sample metadata</div> --- <span class="bullet"><img src="/_modules/pep-format/subproject.svg" width="50" class="bullet">Subprojects</span><br> <div class="well">Define activatable project attributes.</div> ```yaml project_modifiers: amendments: diverse: metadata: sample_annotation: psa_rrbs_diverse.csv cancer: metadata: sample_annotation: psa_rrbs_intracancer.csv ``` <div class="well">Benefit: Defines multiple similar projects in a single file</div> --- <style> #acknowledgements { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="acknowledgements" data-background="/images/presentations/bg.svg.png"> # Thank You <br clear="all"/> <span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> · <span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> · <span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span> <div class="bullet" style="background-color:rgb(45,45,45,.65); border-radius: 25px; opacity:0.9"> <img src="/images/external/uva_dgs_logo.svg" height="65"> <img src="/images/logo/logo_databio_long.svg" height="45"> </div> </section> --- We are now in the # Era of Large Biomedical Data <br><br> <span class="fragment">Hypothesis: <br><br> # The most important advances of the future will come from studies that can integrate data from lots of sources </span> <div class="fragment">Integrating data introduces 2 major challenges: <br/> <ol> <li><span class="fragment">Data scale</span></li> <li><span class="fragment">Data harmonization</span></li> </ol> </div> --- # Why is data harmonization hard? <div class="fragment"> Because it's exponential.<br> Each new dataset adds N additional pairwise comparisons. <img src="/shorts/pepkit/stars.gif"> </div> --- # The conundrum We stand to benefit immensely <br/> from integrating broader and broader data sources.<br><br> BUT...the wider our integration effort,<br/> the more challenging the integration. --- <div> <img src="/shorts/pepkit/pep_logo_white.svg" width="150" > <h3>Pepkit</h3> A structure and toolkit for organizing large-scale, <br> sample-intensive biological research projects<br> <div class="small"> <a href="http://pepkit.github.io/">http://pepkit.github.io/</a><br> </div> </div> <span class="small bullet"><img src="/shorts/pepkit/paper.svg" height="25" class="bullet"><a href="http://dx.doi.org/10.1093/gigascience/giab077">Sheffield et al. (2021).</a> <i>GigaScience</i>.</span> <br/> <ul class="fragment"> <li>1. Metadata management</li> <li>2. Pipeline development</li> <li>3. Reproducible computing environments</li> </ul> --- <img src="/shorts/pepkit/pepkit_subway_map.png"> --- ## PEP: Portable Encapsulated Projects <img src="/_modules/pep-format/pep_center_white.svg" width="700"> --- <div class="bullet"> <h2><img src="/_modules/pep-format/pep_logo.svg" width="70">PEP format</h2> </div> Start with a simple CSV with tabular data. <hr> <div class="bullet"> <img src="/_modules/pep-format/file.svg" width="30">samples.csv </div> ``` sample_name,protocol,organism,input_file frog_0h,RNA-seq,frog,/path/to/frog0.gz frog_1h,RNA-seq,frog,/path/to/frog1.gz frog_2h,RNA-seq,frog,/path/to/frog2.gz frog_3h,RNA-seq,frog,/path/to/frog3.gz ``` --- <div class="bullet"> <h2><img src="/_modules/pep-format/pep_logo.svg" width="70">PEP format</h2> </div> Add a YAML for project-level data. <hr> <div class="bullet"> <img src="/_modules/pep-format/file.svg" width="30">samples.csv </div> ``` sample_name,protocol,organism,input_file frog_0h,RNA-seq,frog,/path/to/frog0.gz frog_1h,RNA-seq,frog,/path/to/frog1.gz frog_2h,RNA-seq,frog,/path/to/frog2.gz frog_3h,RNA-seq,frog,/path/to/frog3.gz ``` <hr> <div class="bullet"> <img src="/_modules/pep-format/file.svg" width="30">project_config.yaml </div> ```yaml sample_table: /path/to/samples.csv output_dir: /path/to/output/folder other_variable: value ``` --- ### Add programmatic sample and project modifiers. <div style="text-align: left"> <span class="bullet"><img src="/_modules/pep-format/replace_white.svg" width="50" class="bullet">Derived attributes</span><br> <span class="bullet"><img src="/_modules/pep-format/implies_white.svg" width="50" class="bullet">Implied attributes</span><br> <span class="bullet"><img src="/_modules/pep-format/subproject_white.svg" width="50" class="bullet">Subprojects</span><br> </div> --- <span class="bullet"><img src="/_modules/pep-format/replace_white.svg" width="50" class="bullet">Derived attributes</span><br> <div class="well">Automatically build new sample attributes from existing attributes.</div> Without derived attribute: | sample_name | t | protocol | organism | input_file | |-------------|---|:--------:|----------|------------| | frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz | | frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz | | frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz | | frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz | Using derived attribute: | sample_name | t | protocol | organism | input_file | |-------------|---|:--------:|----------|------------| | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples | --- | sample_name | t | protocol | organism | input_file | |-------------|---|:--------:|----------|------------| | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples | Project config file: ```yaml sample_modifiers: derive: attributes: [input_file] sources: my_samples: "/path/to/my/samples/{organism}_{t}h.gz" your_samples: "/path/to/your/samples/{organism}_{t}h.gz" ``` {variable} identifies sample annotation columns <div class="well">Benefit: Enables distributed files, portability</div> --- <span class="bullet"><img src="/_modules/pep-format/implies_white.svg" width="50" class="bullet">Implied attributes</span><br> <div class="well">Add new sample attributes conditioned on values of existing attributes</div> <div class="col2"> Before:<br> | sample_name | protocol | organism | |-------------|:--------:|----------| | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse | </div> <div class="col2"> After:<br> | sample_name | protocol | organism | genome | |-------------|:--------:|----------|--------| | human_1 | RNA-seq | human | hg38 | | human_2 | RNA-seq | human | hg38 | | human_3 | RNA-seq | human | hg38 | | mouse_1 | RNA-seq | mouse | mm10 | </div> --- | sample_name | protocol | organism | |-------------|:--------:|----------| | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse | Project config file: ```yaml sample_modifiers: imply: - if: organism: human then: genome: hg38 - if: organism: mouse then: genome: mm10 ``` <div class="well">Benefit: Divides project from sample metadata</div> --- <span class="bullet"><img src="/_modules/pep-format/subproject.svg" width="50" class="bullet">Subprojects</span><br> <div class="well">Define activatable project attributes.</div> ```yaml project_modifiers: amendments: diverse: metadata: sample_annotation: psa_rrbs_diverse.csv cancer: metadata: sample_annotation: psa_rrbs_intracancer.csv ``` <div class="well">Benefit: Defines multiple similar projects in a single file</div> --- ## <img src="/_modules/bulker/bulker_logo.svg" width="250" style="vertical-align: middle;"> A multi-container environment manager. <span class="small bullet"><img src="/_modules/bulker/icons/paper.svg" height="25" class="bullet"> Sheffield (2019). <i>OSF Preprints</i>.</span> <div class="small"> <a href="https://bulker.io">https://bulker.io</a> </div> --- ## Reproducibility data + code <span class="fragment">+ environment</span> --- ## Containers A promising solution, but how should we use them? <div style="display: flex; justify-content: space-between;"> <div style="width: 45%; text-align: center;"> Combined <img src="/_modules/bulker/monolithic_containers.svg" style="padding:15px; background:white; width:300px"> </div> <div style="width: 45%; text-align: center;"> Individual <img src="/_modules/bulker/modular_containers.svg" style="padding:15px; background:white; width:300px"> </div> </div> --- ## Trade-offs | | Combined | Individual | Bulker | |---|:---:|:---:|:---:| | easy to deploy | ✓ | ✗ | ✓ | | easy to use | ✓ | ✗ | ✓ | | reusable | ✗ | ✓ | ✓ | | combinable | ✗ | ✓ | ✓ | | subsetable | ✗ | ✓ | ✓ | | space efficient | ✗ | ✓ | ✓ | --- ## How bulker does it Two conceptual advances: <div style="display: flex; justify-content: space-between;"> <div style="width: 45%; text-align: center;"> Containerized executables <img src="/_modules/bulker/bulker_executables.svg" style="padding:10px; background:white; width:350px"> </div> <div style="width: 45%; text-align: center;"> Distribute containers in sets <img src="/_modules/bulker/modular_containers_to_crate.svg" style="padding:10px; background:white; width:325px"> </div> </div> --- <section id="acknowledgements" data-background="bg.svg.png"> ## Thank You <div class="col3" style="font-size:.6em"> <b>Collaborators</b> <br>Vince Reuter <br>Andre Rendeiro <br>Levi Waldron <br><br> <b>Alumni</b> <br>Aaron Gu <br>Jianglin Feng <br>Ognen Duzlevski <br>Tessa Danehy </div> <div class="col3" style="font-size:.6em"> <b>Sheffield lab</b> <br>Erfaneh Gharavi <br>Michal Stolarczyk <br>John Lawson <br>Jason Smith <br>Kristyna Kupkova <br>John Stubbs <br>Bingjie Xue <br>Jose Verdezoto <br>Nathan LeRoy <br>Oleksandr Khoroshevskyi </div> <div class="col3" style="font-size:.6em"> <b>Funding:</b><br> <br><img src="/shorts/adventure/logo/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/shorts/adventure/logo/University_of_Virginia_logo_white.svg" height="40"> <br><img src="/shorts/adventure/logo/NIH_logo_black.svg" height="80"><br>NIGMS R35-GM128636 </div> <br clear="all"/> <span class="small bullet"><a href="http://github.com/nsheff">nsheff</a></span> · <span class="small bullet"><a href="http://databio.org">databio.org</a></span> · <span class="small bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span><br> <br> <div class="bullet" style="background-color:rgb(45,45,45,.65); border-radius: 25px; opacity:0.9"><img src="/shorts/adventure/logo/uva_dgs_logo.svg" height="65"><img src="/shorts/adventure/logo/logo_databio_long.svg" height="45"></div> </section>