<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # ATAC-seq analysis via pipeline Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- <!-- .slide: data-background="/images/presentations/bg.svg.png" data-transition-speed="slow" --> ### Outline <style> .previewblock { float: left; width: 20px; height: 45px; margin: 0; border: none; white-space: nowrap; box-sizing: border-box; } .questionblock { float: left; width: 100%; margin: 5px 0; border: 1px solid rgba(255, 255, 255, .2); } </style> <div class="previewblock" style="width:30%">Pipelines</div> <div class="previewblock" style="width:40%">ATAC-seq pipeline (PEPATAC)</div> <div class="previewblock" style="width:30%">ATAC-seq QC</div> <div class="previewblock" style="width:30%">|</div> <div class="previewblock" style="width:40%">|</div> <div class="previewblock" style="width:30%">|</div> <br clear="all"> <div class="previewblock" style="width:30%; background:#883388">30%</div> <div class="previewblock" style="width:40%; background:#338833">40%</div> <div class="previewblock" style="width:30%; background:#338888">30%</div> <div class="previewblock" style="width:30%"></div> <div class="previewblock" style="width:40%"></div> <div class="previewblock" style="width:30%"></div> <br clear="all"> <div class="previewblock" style="width:30%"></div> <div class="previewblock" style="width:40%"></div> <div class="previewblock" style="width:30%"></div> <div class="questionblock" style="background:#222; color:#eee; font-size: 0.6em; margin-top: 35px">◁ Questions ▷</div> --- ### Analysis spectrum  --- ### Advantages: interactive analysis vs pipelines <div class="col2"> <u>Interactive</u><br> - More universal learning curve<br> - Direct control<br> - Quicker to get started<br> - Easier for simple analysis<br> </div> <div class="col2"> <u>Pipeline</u><br> - Easier for high volume<br> - More robust error handling<br> - Automatic logging<br> - Restartable<br> - Built-in monitoring<br> - More repeatable<br> - More reproducible<br> </div> --- ### What should I use?  The combination that best fits your project requirements. --- ### ATAC-seq pipelines <div class="well"> There is growing need for integrated pipelines to process ATAC-seq data. Several have been developed but have different focus for downstream analysis by stitching together previously discussed tools. (<a href="10.1186/s13059-020-1929-3">Yan et al. 2020</a>) </div> - [ENCODE ATAC-seq pipeline](https://github.com/ENCODE-DCC/atac-seq-pipeline) - [PEPATAC](http://pepatac.databio.org) - [esATAC](https://www.bioconductor.org/packages/release/bioc/html/esATAC.html) - [More pipelines](https://github.com/databio/awesome-atac-analysis) --- ## <img src="/_modules/pepatac/logo_pepatac.svg" width="175" style="padding-top:25px; padding-bottom:25px; vertical-align: middle;"> PEPATAC An optimized ATAC-seq pipeline with serial alignments <div class="small"> <a href="http://pepatac.databio.org">http://pepatac.databio.org</a> </div> <span class="small bullet"><img src="/_modules/pepatac/icons/paper.svg" height="25" class="bullet">Smith et al (2021, In press). <i>NAR Genomics and Bioinformatics</i>.</span> --- ## PEPATAC workflow <img src="/_modules/pepatac/pepatac_workflow_white.svg" width="600"> --- ## PEPATAC strengths <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <span style="color:goldenrod">Modular system</span> <br><br> <span>Prealignments</span> </div> <div style="width: 45%;"> <span>Flexibility and portability</span> <br><br> <span>Outputs</span> </div> </div> --- ## Command-line interface with only 3 required arguments ``` $ /pipelines/pepatac.py -h ``` ``` usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE] [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT] [-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE] [--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME] [--anno-name ANNO_NAME] [--keep] [--peak-caller {fseq,macs2}] [--trimmer {trimmomatic,skewer}] [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V] PEPATAC version 0.7.3 optional arguments: -h, --help show this help message and exit -R, --recover Overwrite locks to recover from previous failed run -N, --new-start Overwrite all results to start a fresh run -D, --dirty Don't auto-delete intermediate files -F, --force-follow Always run 'follow' commands -C CONFIG_FILE, --config CONFIG_FILE Pipeline configuration file (YAML). Relative paths are with respect to the pipeline script. -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER Parent output directory of project -M MEMORY_LIMIT, --mem MEMORY_LIMIT Memory limit (in Mb) for processes accepting such -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES Number of cores for parallelized processes -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]] Secondary input files, such as read2 -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED Single- or paired-end sequencing protocol -gs GENOME_SIZE, --genome-size GENOME_SIZE genome size for MACS2 --frip-ref-peaks FRIP_REF_PEAKS Reference peak set for calculating FRiP --TSS-name TSS_NAME Name of TSS annotation file --anno-name ANNO_NAME Name of reference bed file for calculating FRiF --keep Keep prealignment BAM files --peak-caller {fseq,macs2} Name of peak caller --trimmer {trimmomatic,pyadapt,skewer} Name of read trimming program --prealignments PREALIGNMENTS [PREALIGNMENTS ...] Space-delimited list of reference genomes to align to before primary alignment. -V, --version show program's version number and exit required named arguments: -S SAMPLE_NAME, --sample-name SAMPLE_NAME Name for sample to run -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...] One or more primary input files -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY Identifier for genome assembly ``` --- ## Portable Encapsulated Projects (PEP) provide interoperability <img src="/_modules/pepatac/pepatac_modularity_1.svg" width="600"> --- ## Portable Encapsulated Projects (PEP) provide interoperability <img src="/_modules/pepatac/pepatac_modularity_2.svg" width="600"> --- ## PEP specification for sample metadata 1. Configuration file: `config.yaml` ```yaml pep_version: 2.0.0 sample_table: "path/to/sample_table.csv" ``` 2. Tabular sample annotation table: `sample_table.csv`: ```csv "sample_name", "protocol", "file" "frog_1", "ATAC-seq", "frog1.fq.gz" "frog_2", "ATAC-seq", "frog2.fq.gz" "frog_3", "ATAC-seq", "frog3.fq.gz" "frog_4", "ATAC-seq", "frog4.fq.gz" ``` <a href="http://pep.databio.org">pep.databio.org</a> --- ## MapReduce or Scatter/Gather 1. Map/Scatter PEPATAC across individual samples ```bash looper run config.yaml ``` 2. Gather results and do cross-sample analysis ```bash looper runp config.yaml ``` --- ## PEPATAC strengths <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <span>Modular system</span> <br><br> <span style="color:goldenrod">Prealignments</span> </div> <div style="width: 45%;"> <span>Flexibility and portability</span> <br><br> <span>Outputs</span> </div> </div> <div class="fragment"> Nuclear-mitochondrial DNA (NuMts) confuse aligners </div> --- ## Nuclear-mitochondrial DNA (NuMts) <img src="/_modules/pepatac/numts.svg" width="800"> <img src="/_modules/pepatac/numts_simultaneous_alignment.svg" width="800" class="fragment"> <img src="/_modules/pepatac/numts_alignment.svg" width="800" class="fragment"> --- ## NuMts alignment problems <img src="/_modules/pepatac/numts.svg" width="800"> <img src="/_modules/pepatac/numts_alignment_problems.svg" width="800"> <img src="/_modules/pepatac/numts_alignment.svg" width="800"> --- ## NuMts with blacklist approach <img src="/_modules/pepatac/numts.svg" width="800"> <img src="/_modules/pepatac/numts_alignment_problems.svg" width="800"> <img src="/_modules/pepatac/numts_alignment_blacklist.svg" width="800"> --- ## Problems with region masking <img src="/_modules/pepatac/numts_alignment_blacklist.svg" width="800"> - Inaccurate alignment statistics - Requires pre-defined NuMt locations - Wastes compute power --- ## Serial prealignments <img src="/_modules/pepatac/prealignments.svg" width="800"> --- ## Serial prealignments process <img src="/_modules/pepatac/prealignments2.svg" width="800"> --- ## Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies. --- ## Prealignment mapping rates <img src="/_modules/pepatac/prealignment_mapping.svg" width="800"> --- ## Prealignment speed improvements <img src="/_modules/pepatac/prealignment_speed.svg" width="800"> --- ## PEPATAC strengths <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <span>Modular system</span> <br><br> <span>Prealignments</span> </div> <div style="width: 45%;"> <span style="color:goldenrod">Flexibility and portability</span> <br><br> <span>Outputs</span> </div> </div> --- ## Flexibility and Portability - trimmer options: `skewer` and `trimmomatic` - peak caller options: `macs2` and `fseq` - aligner options: `bowtie2` and `bwa` ```bash ./pepatac.py --trimmer trimmomatic --peak-caller fseq ``` --- ## Flexibility and Portability Parameterization via config file `pepatac.yaml`: ```yaml # basic tools tools: # absolute paths to required tools java: java python: python samtools: samtools bedtools: bedtools bowtie2: bowtie2 fastqc: fastqc macs2: macs2 picard: ${PICARD} skewer: skewer perl: perl # ucsc tools bedGraphToBigWig: bedGraphToBigWig wigToBigWig: wigToBigWig bigWigCat: bigWigCat bedSort: bedSort bedToBigBed: bedToBigBed # optional tools fseq: fseq trimmo: ${TRIMMOMATIC} Rscript: Rscript # user configure resources: genomes: ${GENOMES} adapters: null # Set to null to use default adapters parameters: # parameters passed to bioinformatic tools samtools: q: 10 macs2: f: BED q: 0.01 shift: 0 fseq: of: npf # narrowPeak as output format l: 600 # feature length t: 4.0 # "threshold" (standard deviations) s: 1 # wiggle track step ``` --- ## Flexibility and Portability Running options: - natively - conda - containers using `docker` or `singularity`. - use bulker to manage containers for your (http://bulker.io) ```bash git clone github.com/databio/pepatac docker pull databio/pepatac docker run --rm -it databio/pepatac pipelines/pepatac.py ``` --- ## PEPATAC strengths <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <span>Modular system</span> <br><br> <span>Prealignments</span> </div> <div style="width: 45%;"> <span>Flexibility and portability</span> <br><br> <span style="color:goldenrod">Outputs</span> </div> </div> --- ## Output <img src="/_modules/pepatac/pepatac_output_sample.svg" width="650"> --- ## Summary report <img src="/_modules/pepatac/pepatac_summary.png" width="800"> <a href="http://pepatac.databio.org/en/latest/files/examples/gold/gold_summary.html">http://pepatac.databio.org/en/latest/files/examples/gold/gold_summary.html</a> --- ## PEPATAC in practice <div style="font-size:0.5em"> - **O'Connor et al. (2021).** *bioRxiv*. DOI: [10.1101/2021.07.15.452570](http://dx.doi.org/10.1101/2021.07.15.452570) - **Ram-Mohan et al. (2021).** *Life Science Alliance*. DOI: [10.26508/lsa.202000976](http://dx.doi.org/10.26508/lsa.202000976) - **Robertson et al. (2021).** *Nature Genetics*. DOI: [10.1038/s41588-021-00880-5](http://dx.doi.org/10.1038/s41588-021-00880-5) - **Cheung et al. (2021).** DOI: [10.1038/s41590-021-00928-y](http://dx.doi.org/10.1038/s41590-021-00928-y) - **Hasegawa et al. (2021).** *bioRxiv*. DOI: [10.1101/2021.04.28.441728](http://dx.doi.org/10.1101/2021.04.28.441728) - **Weber et al. (2021).** *Science*. DOI: [10.1126/science.aba1786](http://dx.doi.org/10.1126/science.aba1786) - **Tovar et al. (2021).** *bioRxiv*. DOI: [10.1101/2021.01.29.428733](http://dx.doi.org/10.1101/2021.01.29.428733) - **Granja et al. (2021).** *Nature Genetics*. DOI: [10.1038/s41588-021-00790-6](http://dx.doi.org/10.1038/s41588-021-00790-6) - **Fan et al. (2020).** *Cell Reports*. DOI: [10.1016/j.celrep.2020.108473](http://dx.doi.org/10.1016/j.celrep.2020.108473) - **Smith and Sheffield (2020).** *Current Protocols in Human Genetics*. DOI: [10.1002/cphg.101](http://dx.doi.org/10.1002/cphg.101) - **Liu (2020).** DOI: [10.18632/oncotarget.27584](http://dx.doi.org/10.18632/oncotarget.27584) - **Zhou et al. (2020).** *bioRxiv*. DOI: [10.1101/2020.05.16.099325](http://dx.doi.org/10.1101/2020.05.16.099325) - **Cai et al. (2020).** DOI: [10.1186/s12920-020-0695-0](http://dx.doi.org/10.1186/s12920-020-0695-0) - **Li et al. (2020).** DOI: [10.1038/s41419-020-2303-9](http://dx.doi.org/10.1038/s41419-020-2303-9) - **Liang et al. (2019).** DOI: [10.1002/1873-3468.13549](http://dx.doi.org/10.1002/1873-3468.13549) - **Corces et al. (2018).** *Science*. DOI: [10.1126/science.aav1898](http://dx.doi.org/10.1126/science.aav1898) </div> --- <style> #acknowledgements { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="acknowledgements" data-background="/images/presentations/bg.svg.png"> # Thank You <br clear="all"/> <span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> · <span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> · <span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span> <div class="bullet" style="background-color:rgb(45,45,45,.65); border-radius: 25px; opacity:0.9"> <img src="/images/external/uva_dgs_logo.svg" height="65"> <img src="/images/logo/logo_databio_long.svg" height="45"> </div> </section>