<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # ATAC-seq pipeline processing Jason Smith, Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- ## <img src="/_modules/pepatac/logo_pepatac.svg" width="175" style="padding-top:25px; padding-bottom:25px; vertical-align: middle;"> PEPATAC An optimized ATAC-seq pipeline with serial alignments <div class="small"> <a href="http://pepatac.databio.org">http://pepatac.databio.org</a> </div> <span class="small bullet"><img src="/_modules/pepatac/icons/paper.svg" height="25" class="bullet">Smith et al (2021, In press). <i>NAR Genomics and Bioinformatics</i>.</span> --- ## PEPATAC workflow <img src="/_modules/pepatac/pepatac_workflow_white.svg" width="600"> --- ## PEPATAC strengths <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <span style="color:goldenrod">Modular system</span> <br><br> <span>Prealignments</span> </div> <div style="width: 45%;"> <span>Flexibility and portability</span> <br><br> <span>Outputs</span> </div> </div> --- ## Command-line interface with only 3 required arguments ``` $ /pipelines/pepatac.py -h ``` ``` usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE] [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT] [-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE] [--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME] [--anno-name ANNO_NAME] [--keep] [--peak-caller {fseq,macs2}] [--trimmer {trimmomatic,skewer}] [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V] PEPATAC version 0.7.3 optional arguments: -h, --help show this help message and exit -R, --recover Overwrite locks to recover from previous failed run -N, --new-start Overwrite all results to start a fresh run -D, --dirty Don't auto-delete intermediate files -F, --force-follow Always run 'follow' commands -C CONFIG_FILE, --config CONFIG_FILE Pipeline configuration file (YAML). Relative paths are with respect to the pipeline script. -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER Parent output directory of project -M MEMORY_LIMIT, --mem MEMORY_LIMIT Memory limit (in Mb) for processes accepting such -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES Number of cores for parallelized processes -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]] Secondary input files, such as read2 -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED Single- or paired-end sequencing protocol -gs GENOME_SIZE, --genome-size GENOME_SIZE genome size for MACS2 --frip-ref-peaks FRIP_REF_PEAKS Reference peak set for calculating FRiP --TSS-name TSS_NAME Name of TSS annotation file --anno-name ANNO_NAME Name of reference bed file for calculating FRiF --keep Keep prealignment BAM files --peak-caller {fseq,macs2} Name of peak caller --trimmer {trimmomatic,pyadapt,skewer} Name of read trimming program --prealignments PREALIGNMENTS [PREALIGNMENTS ...] Space-delimited list of reference genomes to align to before primary alignment. -V, --version show program's version number and exit required named arguments: -S SAMPLE_NAME, --sample-name SAMPLE_NAME Name for sample to run -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...] One or more primary input files -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY Identifier for genome assembly ``` --- ## Portable Encapsulated Projects (PEP) provide interoperability <img src="/_modules/pepatac/pepatac_modularity_1.svg" width="600"> --- ## Portable Encapsulated Projects (PEP) provide interoperability <img src="/_modules/pepatac/pepatac_modularity_2.svg" width="600"> --- ## PEP specification for sample metadata 1. Configuration file: `config.yaml` ```yaml pep_version: 2.0.0 sample_table: "path/to/sample_table.csv" ``` 2. Tabular sample annotation table: `sample_table.csv`: ```csv "sample_name", "protocol", "file" "frog_1", "ATAC-seq", "frog1.fq.gz" "frog_2", "ATAC-seq", "frog2.fq.gz" "frog_3", "ATAC-seq", "frog3.fq.gz" "frog_4", "ATAC-seq", "frog4.fq.gz" ``` <a href="http://pep.databio.org">pep.databio.org</a> --- ## MapReduce or Scatter/Gather 1. Map/Scatter PEPATAC across individual samples ```bash looper run config.yaml ``` 2. Gather results and do cross-sample analysis ```bash looper runp config.yaml ``` --- ## PEPATAC strengths <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <span>Modular system</span> <br><br> <span style="color:goldenrod">Prealignments</span> </div> <div style="width: 45%;"> <span>Flexibility and portability</span> <br><br> <span>Outputs</span> </div> </div> <div class="fragment"> Nuclear-mitochondrial DNA (NuMts) confuse aligners </div> --- ## Nuclear-mitochondrial DNA (NuMts) <img src="/_modules/pepatac/numts.svg" width="800"> <img src="/_modules/pepatac/numts_simultaneous_alignment.svg" width="800" class="fragment"> <img src="/_modules/pepatac/numts_alignment.svg" width="800" class="fragment"> --- ## NuMts alignment problems <img src="/_modules/pepatac/numts.svg" width="800"> <img src="/_modules/pepatac/numts_alignment_problems.svg" width="800"> <img src="/_modules/pepatac/numts_alignment.svg" width="800"> --- ## NuMts with blacklist approach <img src="/_modules/pepatac/numts.svg" width="800"> <img src="/_modules/pepatac/numts_alignment_problems.svg" width="800"> <img src="/_modules/pepatac/numts_alignment_blacklist.svg" width="800"> --- ## Problems with region masking <img src="/_modules/pepatac/numts_alignment_blacklist.svg" width="800"> - Inaccurate alignment statistics - Requires pre-defined NuMt locations - Wastes compute power --- ## Serial prealignments <img src="/_modules/pepatac/prealignments.svg" width="800"> --- ## Serial prealignments process <img src="/_modules/pepatac/prealignments2.svg" width="800"> --- ## Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies. --- ## Prealignment mapping rates <img src="/_modules/pepatac/prealignment_mapping.svg" width="800"> --- ## Prealignment speed improvements <img src="/_modules/pepatac/prealignment_speed.svg" width="800"> --- ## PEPATAC strengths <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <span>Modular system</span> <br><br> <span>Prealignments</span> </div> <div style="width: 45%;"> <span style="color:goldenrod">Flexibility and portability</span> <br><br> <span>Outputs</span> </div> </div> --- ## Flexibility and Portability - trimmer options: `skewer` and `trimmomatic` - peak caller options: `macs2` and `fseq` - aligner options: `bowtie2` and `bwa` ```bash ./pepatac.py --trimmer trimmomatic --peak-caller fseq ``` --- ## Flexibility and Portability Parameterization via config file `pepatac.yaml`: ```yaml # basic tools tools: # absolute paths to required tools java: java python: python samtools: samtools bedtools: bedtools bowtie2: bowtie2 fastqc: fastqc macs2: macs2 picard: ${PICARD} skewer: skewer perl: perl # ucsc tools bedGraphToBigWig: bedGraphToBigWig wigToBigWig: wigToBigWig bigWigCat: bigWigCat bedSort: bedSort bedToBigBed: bedToBigBed # optional tools fseq: fseq trimmo: ${TRIMMOMATIC} Rscript: Rscript # user configure resources: genomes: ${GENOMES} adapters: null # Set to null to use default adapters parameters: # parameters passed to bioinformatic tools samtools: q: 10 macs2: f: BED q: 0.01 shift: 0 fseq: of: npf # narrowPeak as output format l: 600 # feature length t: 4.0 # "threshold" (standard deviations) s: 1 # wiggle track step ``` --- ## Flexibility and Portability Running options: - natively - conda - containers using `docker` or `singularity`. - use bulker to manage containers for your (http://bulker.io) ```bash git clone github.com/databio/pepatac docker pull databio/pepatac docker run --rm -it databio/pepatac pipelines/pepatac.py ``` --- ## PEPATAC strengths <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <span>Modular system</span> <br><br> <span>Prealignments</span> </div> <div style="width: 45%;"> <span>Flexibility and portability</span> <br><br> <span style="color:goldenrod">Outputs</span> </div> </div> --- ## Output <img src="/_modules/pepatac/pepatac_output_sample.svg" width="650"> --- ## Summary report <img src="/_modules/pepatac/pepatac_summary.png" width="800"> <a href="http://pepatac.databio.org/en/latest/files/examples/gold/gold_summary.html">http://pepatac.databio.org/en/latest/files/examples/gold/gold_summary.html</a> --- ## PEPATAC in practice <div style="font-size:0.5em"> - **O'Connor et al. (2021).** *bioRxiv*. DOI: [10.1101/2021.07.15.452570](http://dx.doi.org/10.1101/2021.07.15.452570) - **Ram-Mohan et al. (2021).** *Life Science Alliance*. DOI: [10.26508/lsa.202000976](http://dx.doi.org/10.26508/lsa.202000976) - **Robertson et al. (2021).** *Nature Genetics*. DOI: [10.1038/s41588-021-00880-5](http://dx.doi.org/10.1038/s41588-021-00880-5) - **Cheung et al. (2021).** DOI: [10.1038/s41590-021-00928-y](http://dx.doi.org/10.1038/s41590-021-00928-y) - **Hasegawa et al. (2021).** *bioRxiv*. DOI: [10.1101/2021.04.28.441728](http://dx.doi.org/10.1101/2021.04.28.441728) - **Weber et al. (2021).** *Science*. DOI: [10.1126/science.aba1786](http://dx.doi.org/10.1126/science.aba1786) - **Tovar et al. (2021).** *bioRxiv*. DOI: [10.1101/2021.01.29.428733](http://dx.doi.org/10.1101/2021.01.29.428733) - **Granja et al. (2021).** *Nature Genetics*. DOI: [10.1038/s41588-021-00790-6](http://dx.doi.org/10.1038/s41588-021-00790-6) - **Fan et al. (2020).** *Cell Reports*. DOI: [10.1016/j.celrep.2020.108473](http://dx.doi.org/10.1016/j.celrep.2020.108473) - **Smith and Sheffield (2020).** *Current Protocols in Human Genetics*. DOI: [10.1002/cphg.101](http://dx.doi.org/10.1002/cphg.101) - **Liu (2020).** DOI: [10.18632/oncotarget.27584](http://dx.doi.org/10.18632/oncotarget.27584) - **Zhou et al. (2020).** *bioRxiv*. DOI: [10.1101/2020.05.16.099325](http://dx.doi.org/10.1101/2020.05.16.099325) - **Cai et al. (2020).** DOI: [10.1186/s12920-020-0695-0](http://dx.doi.org/10.1186/s12920-020-0695-0) - **Li et al. (2020).** DOI: [10.1038/s41419-020-2303-9](http://dx.doi.org/10.1038/s41419-020-2303-9) - **Liang et al. (2019).** DOI: [10.1002/1873-3468.13549](http://dx.doi.org/10.1002/1873-3468.13549) - **Corces et al. (2018).** *Science*. DOI: [10.1126/science.aav1898](http://dx.doi.org/10.1126/science.aav1898) </div> --- ## Conclusion <div class="col2"> <div style="padding:7px"> If you're doing ATAC-seq analysis Try pepatac! <img src="/shorts/short-pepatac/pepatac/pepatac_logo.svg" width="185" style="padding-top:35px; padding-bottom:35px"> [code.databio.org/PEPATAC](http://code.databio.org/PEPATAC) </div></div> <div class="col2"> <div style="padding:7px"> If you're developing pipelines Try looper! <img src="/shorts/short-pepatac/logo/logo_looper.svg" width="125"> [looper.readthedocs.io](http://looper.readthedocs.io) </div> </div> --- Everyone else Eat chicken nuggets! <img src="/shorts/short-pepatac/icons/white_microwave.svg" width="125"> --- ## Acknowledgments <div class="col3" style="font-size:.6em"> <img src="/shorts/short-pepatac/logo/University_of_Virginia_Rotunda_logo.svg" height="40"><img src="/shorts/short-pepatac/logo/University_of_Virginia_logo_white.svg" height="40"> **Sheffield lab** - John Lawson - Vince Reuter - Jason Smith - Jianglin Feng - Michal Stolarczyk - Aaron Gu </div> <div class="col3" style="font-size:.6em"> <img src="/shorts/short-pepatac/logo/logo_cemm.svg" height="30"> **Christoph Bock** - Andre Rendeiro - Johanna Klughammer <img src="/shorts/short-pepatac/logo/stanford.svg" height="30"> **Howard Chang** - Ryan Corces - Yuning Wei - Jin Xu </div> <div class="col3" style="font-size:.6em"> **Funding:** <img src="/shorts/short-pepatac/logo/University_of_Virginia_logo_white.svg" height="40"> <img src="/shorts/short-pepatac/logo/NIH_logo_black.svg" height="80"> <img src="/shorts/short-pepatac/logo/hfsp_logo.svg" height="60"> </div> --- <style> #acknowledgements { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="acknowledgements" data-background="/images/presentations/bg.svg.png"> # Thank You <br clear="all"/> <span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> · <span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> · <span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span> <div class="bullet" style="background-color:rgb(45,45,45,.65); border-radius: 25px; opacity:0.9"> <img src="/images/external/uva_dgs_logo.svg" height="65"> <img src="/images/logo/logo_databio_long.svg" height="45"> </div> </section> --- ### Parallelism Philosophy <img src="/_modules/parallelism/parallel_sequential.svg" width="100%"><br> <div class="fragment"> <div class="col3" style="background-color:#211">by process <img src="/_modules/parallelism/parallel_process.svg" width="300"> </div> <div class="col3" style="background-color:#112">by sample <img src="/_modules/parallelism/parallel_sample.svg" width="300"> </div> <div class="col3" style="background-color:#121">by dependence <img src="/_modules/parallelism/parallel_dependency.svg" width="300"> </div> </div> <br clear="all"> <div class="fragment"> <div class="col3" style="background-color:#211">Very easy</div> <div class="col3" style="background-color:#112">Easy</div> <div class="col3" style="background-color:#121">Hard</div> </div> <br clear="all"> <div class="fragment"> <div class="col3" style="background-color:#211"> <img src="/_modules/parallelism/parallel_process_benefit.svg" width="300"> </div> <div class="col3" style="background-color:#112"> <img src="/_modules/parallelism/parallel_sample_benefit.svg" width="300"> </div> <div class="col3" style="background-color:#121"> <img src="/_modules/parallelism/parallel_dependency_benefit.svg" width="300"> </div> </div>