<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # Bioinformatics pipeline development and deployment with Pypiper and Looper Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- ## Jordan's story <table class="bullets"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/jordan_smile.svg" align="left" width="125" /></td> <td>Jordan got data from the sequencer today.</td> </tr> </table> <table class="bullets fragment" style="border:0px solid black; border-collapse: collapse;"> <tr> <td width="135"><img src="/slides/pypiper-looper/icons/console.svg" align="left" width="125" /></td> <td>He sits down at the terminal to process it. <span> ```bash bowtie2 sample.fastq ... trimmomatic ... macs2 output.fastq ... ``` </span> </td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/coin-thumbs--up.svg" width="125" /></td> <td>Nice!</td> </tr> </table> --- ## Two weeks later... --- ## Jordan got data again <table class="bullets"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/jordan_smile.svg" width="125" /></td> <td>Jordan got data from the sequencer today.</td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/icons/console.svg" align="left" width="125" /></td> <td>He sits down at the terminal to process it.</td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/jordan_confused.svg" align="left" width="125" /></td> <td>Hmm... what did I do last time?</td> </tr> </table> --- ## Jordan has an idea <table class="bullets"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/jordan_idea.svg" width="125" /></td> <td>Jordan has an idea.</td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/icons/app.svg" width="125" /></td> <td>He sits down at the terminal to <span style="text-decoration: line-through;">process</span> <b>script</b> it.</td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/coin-thumbs--up.svg" width="125" /></td> <td>Nice!</td> </tr> </table> --- ## Two weeks later... --- ## Server crashes <table class="bullets"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/jordan_smile.svg" width="125" /></td> <td>Jordan got data from the sequencer today.</td> </tr> </table> <table class="bullets"> <tr> <td width="135"><img src="/slides/pypiper-looper/icons/file_shell_script.svg" width="125" /></td> <td>He gets the script running...</td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/server-crash.svg" width="125" /></td> <td>The server crashes. It would be nice if the script could pick up where it left off...</td> </tr> </table> --- ## Scaling up <table class="bullets"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/jordan_smile.svg" width="125" /></td> <td>Now Jordan has 500 samples for a time series experiment.</td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/icons/refresh.svg" width="125" /></td> <td>He starts writing some looping functions to handle cluster submission.</td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/icons/clock.svg" width="125" /></td> <td>This is going to take awhile...</td> </tr> </table> --- ## Different parameters <table class="bullets"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/jordan_smile.svg" width="125" /></td> <td>In the meantime, Jordan generates other samples requiring slightly different parameters.</td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/icons/file_shell_script2.svg" width="125" /></td> <td>No problem, I'll just duplicate this script...</td> </tr> </table> <table class="bullets fragment"> <tr> <td width="135"><img src="/slides/pypiper-looper/characters/lily_open.svg" width="125" /></td> <td>Stop! There is a better way...</td> </tr> </table> --- ## Challenges with shell pipelines <table class="bullets"> <tr> <td><img src="/slides/pypiper-looper/icons/warning.svg" align="left" width="45" /></td> <td>No record of the output of the tools</td> </tr> <tr> <td><img src="/slides/pypiper-looper/icons/warning.svg" align="left" width="45" /></td> <td>Failed steps do not halt the entire pipeline</td> </tr> <tr> <td><img src="/slides/pypiper-looper/icons/warning.svg" align="left" width="45" /></td> <td>Difficult to scale to 500 samples</td> </tr> <tr> <td><img src="/slides/pypiper-looper/icons/warning.svg" align="left" width="45" /></td> <td>Two pipelines running simultaneously may interfere</td> </tr> <tr> <td><img src="/slides/pypiper-looper/icons/warning.svg" align="left" width="45" /></td> <td>Tracking which version was used with which samples</td> </tr> <tr> <td><img src="/slides/pypiper-looper/icons/warning.svg" align="left" width="45" /></td> <td>Memory use is left unmonitored and unchecked</td> </tr> <tr> <td><img src="/slides/pypiper-looper/icons/warning.svg" align="left" width="45" /></td> <td>Requires custom parsers to extract results</td> </tr> </table> --- ## Python modules <div style="display: flex; justify-content: space-between;"> <div style="width: 45%; text-align: center;"> <img src="/slides/pypiper-looper/logo/logo_pypiper.svg" width="150"> ### Pypiper Builds a pipeline for a single sample. </div> <div style="width: 45%; text-align: center;"> <img src="/slides/pypiper-looper/logo/logo_looper.svg" width="150"> ### Looper Deploys pipelines across samples. </div> </div> <img src="/slides/pypiper-looper/icons/merge2.svg" width="150"> <div class="well"> Comprehensive pipeline management system </div> --- ## <img src="/_modules/pypiper/logo_pypiper.svg" width="250" style="vertical-align: middle;"> Pypiper Builds a pipeline for a single sample. <div class="small"> GitHub: <a href="http://github.com/databio/pypiper">http://github.com/databio/pypiper</a> Documentation: <a href="http://pypiper.readthedocs.io">http://pypiper.readthedocs.io</a> </div> --- ## Pypiper features <div style="display: flex; justify-content: space-between;"> <div style="width: 45%; text-align: left;"> <span class="bullet"><img src="/_modules/pypiper/icons/yinyang.svg" width="50" class="bullet"> Simplicity</span><br> <span class="bullet"><img src="/_modules/pypiper/icons/refresh.svg" width="50" class="bullet"> Restartability</span><br> <span class="bullet"><img src="/_modules/pypiper/icons/lock.svg" width="50" class="bullet"> File integrity lock</span><br> <span class="bullet"><img src="/_modules/pypiper/icons/increase.svg" width="50" class="bullet"> Memory monitoring</span><br> </div> <div style="width: 45%; text-align: left;"> <span class="bullet"><img src="/_modules/pypiper/icons/magnify.svg" width="50" class="bullet"> Job monitoring</span><br> <span class="bullet"><img src="/_modules/pypiper/icons/redx.svg" width="50" class="bullet"> Robust error handling</span><br> <span class="bullet"><img src="/_modules/pypiper/icons/file_log.svg" width="50" class="bullet"> Automatic logging</span><br> <span class="bullet"><img src="/_modules/pypiper/icons/printer.svg" width="50" class="bullet"> Easy result reports</span><br> <span class="bullet"><img src="/_modules/pypiper/icons/collate.svg" width="50" class="bullet"> Collate input files</span><br> </div> </div> --- ## <span class="bullet"><img src="/_modules/pypiper/icons/yinyang.svg" width="50" class="bullet"> Simplicity</span> Bash script: ```bash shuf -i 1-500000000 -n 10000000 > outfile.txt ``` Pypiper script: ```python pm.run("shuf -i 1-500000000 -n 10000000 > outfile.txt") ``` Using pypiper is as easy as writing a shell script. Additional *options* provide power *on demand*. --- ## <span class="bullet"><img src="/_modules/pypiper/icons/refresh.svg" width="50" class="bullet"> Restartability</span> ```python target = os.path.join(outfolder, "outfile.txt") # output file cmd = "shuf -i 1-500000000 -n 10000000 > " + target pm.run(command, target) ``` Commands (optionally) only run if target does not already exist. Pipeline will thus pick up where it left off. --- ## <span class="bullet"><img src="/_modules/pypiper/icons/lock.svg" width="50" class="bullet"> File integrity lock</span> **Lock files** ensure commands only run if the target is unlocked: - pipelines will not proceed with incomplete files - multiple pipelines can create/use the same files --- ## <span class="bullet"><img src="/_modules/pypiper/icons/magnify.svg" width="50" class="bullet"> Job monitoring</span> Pypiper uses a flag system to track status <span class="bullet"><img src="/_modules/pypiper/icons/flag_green.svg" width="50" class="bullet"> Job running <img src="/_modules/pypiper/icons/flag_checker.svg" width="50" class="bullet"> Job completed <img src="/_modules/pypiper/icons/flag_red.svg" width="50" class="bullet"> Job failed</span> Summarizing jobs is easy: just count the flags --- ## <span class="bullet"><img src="/_modules/pypiper/icons/redx.svg" width="50" class="bullet"> Robust error handling</span> <img src="/_modules/pypiper/fail_pipeline.svg" height="400"> If a process fails, the pipeline fails. --- ## <span class="bullet"><img src="/_modules/pypiper/icons/file_log.svg" width="50" class="bullet"> Automatic logging</span> <img src="/_modules/pypiper/log_split.svg" height="400"> Output is automatically split to screen and file. --- ## <span class="bullet"><img src="/_modules/pypiper/icons/printer.svg" width="50" class="bullet"> Easy result reports</span> ```python reads = count_reads(unaligned_file) aligned = count_reads(aligned_file) pm.report_result("aligned_reads", aligned) pm.report_result("alignment_rate", aligned/reads) ``` Output: ```tsv aligned_reads 2526232 alignment_rate 0.64234 ``` --- ## Example pipeline ```python import pypiper, os outfolder = "pipeline_output/" # folder for results pm = pypiper.PipelineManager(name="shuf", outfolder) target = os.path.join(outfolder, "outfile.txt") # output file command = "shuf -i 1-500000000 -n 10000000 > " + target pm.run(command, target) pm.stop_pipeline() ``` --- ## <img src="/_modules/looper/logo_looper.svg" width="150" style="vertical-align: middle;"> Looper Deploys pipelines across samples by connecting samples to any command-line tool <div class="small"> <a href="https://looper.databio.org">https://looper.databio.org</a> </div> --- <img src="/_modules/looper/looper_role_white_v2.svg" width="100%"> --- ## pipeline_interface.yaml ```yaml protocol_mappings: RNA-seq: rna-seq pipelines: rna-seq: name: RNA-seq_pipeline path: path/to/rna-seq.py arguments: "--option1": sample_attribute "--option2": sample_attribute2 ``` - maps protocols to pipelines <!-- .element: class="fragment" --> - maps sample attributes (columns) to pipeline arguments <!-- .element: class="fragment" --> --- ## Looper features <div style="display: flex; justify-content: space-between;"> <div style="width: 45%; text-align: left;"> <span class="bullet"><img src="/_modules/looper/icons/input-mouse.svg" width="50" class="bullet"> Single-input runs</span><br> <span class="bullet"><img src="/_modules/looper/icons/flexible.svg" width="50" class="bullet"> Flexible pipelines</span><br> <span class="bullet"><img src="/_modules/looper/icons/piechart.svg" width="50" class="bullet"> Flexible resources</span><br> </div> <div style="width: 45%; text-align: left;"> <span class="bullet"><img src="/_modules/looper/icons/computer.svg" width="50" class="bullet"> Flexible compute</span><br> <span class="bullet"><img src="/_modules/looper/icons/flag_checker.svg" width="50" class="bullet"> Job status-aware</span><br> </div> </div> --- ## <span class="bullet"><img src="/_modules/looper/icons/input-mouse.svg" width="50" class="bullet"> Single-input runs</span> Run your entire project with one line: ```bash looper run project_config.yaml ``` --- ## <span class="bullet"><img src="/_modules/looper/icons/flexible.svg" width="50" class="bullet"> Flexible pipelines</span> ```yaml protocol_mappings: RRBS: rrbs WGBS: wgbs EG: wgbs.py SMART-seq: rnaBitSeq -f; rnaTopHat -f ATAC-SEQ: atacseq DNase-seq: atacseq CHIP-SEQ: chipseq ``` Many-to-many mappings --- ## <span class="bullet"><img src="/_modules/looper/icons/piechart.svg" width="50" class="bullet"> Flexible resources</span> ```yaml pipeline_key: name: pipeline_name arguments: "--option" : value resources: default: file_size: "0" cores: "2" mem: "6000" time: "01:00:00" large_input: file_size: "2000" cores: "4" mem: "12000" time: "08:00:00" ``` Resources can vary by input file size --- ## <span class="bullet"><img src="/_modules/looper/icons/computer.svg" width="50" class="bullet"> Flexible compute</span> ```yaml compute: slurm: submission_template: templates/slurm_template.sub submission_command: sbatch localhost: submission_template: templates/localhost_template.sub submission_command: sh ``` --- Adjust compute package on-the-fly: ```bash looper run project_config.yaml --compute localhost ``` --- ## <span class="bullet"><img src="/_modules/looper/icons/flag_checker.svg" width="50" class="bullet"> Job status-aware</span> Looper only submits jobs for samples not already flagged as running, completed, or failed. ```bash looper check project_config.yaml ``` ```bash looper summarize project_config.yaml ``` --- ## Combine for a complete pipelining system <img src="/slides/pypiper-looper/logo/logo_looper_pypiper.svg" width="250"> --- ## Looper's role <img src="/slides/pypiper-looper/looper_role_white.svg" width="100%"> --- ## How is this better than _____ ? - low barrier to entry (ie, language) - decoupled single-sample (pypiper) from deploy (looper) - simplified parallelism --- ### Parallelism Philosophy <img src="/_modules/parallelism/parallel_sequential.svg" width="100%"><br> <div class="fragment"> <div class="col3" style="background-color:#211">by process <img src="/_modules/parallelism/parallel_process.svg" width="300"> </div> <div class="col3" style="background-color:#112">by sample <img src="/_modules/parallelism/parallel_sample.svg" width="300"> </div> <div class="col3" style="background-color:#121">by dependence <img src="/_modules/parallelism/parallel_dependency.svg" width="300"> </div> </div> <br clear="all"> <div class="fragment"> <div class="col3" style="background-color:#211">Very easy</div> <div class="col3" style="background-color:#112">Easy</div> <div class="col3" style="background-color:#121">Hard</div> </div> <br clear="all"> <div class="fragment"> <div class="col3" style="background-color:#211"> <img src="/_modules/parallelism/parallel_process_benefit.svg" width="300"> </div> <div class="col3" style="background-color:#112"> <img src="/_modules/parallelism/parallel_sample_benefit.svg" width="300"> </div> <div class="col3" style="background-color:#121"> <img src="/_modules/parallelism/parallel_dependency_benefit.svg" width="300"> </div> </div> --- ## Getting started Read the docs! <div style="display: flex; justify-content: space-between;"> <div style="width: 45%; background-color:#011; padding: 20px;"> ### Using a pipeline Create a sample_annotation.csv Create a project_config.yaml Templates exist for both, follow tutorials for Looper. </div> <div style="width: 45%; background-color:#110; padding: 20px;"> ### Building a pipeline Follow tutorials for Pypiper. Write a Pypiper pipeline to handle a single sample. Connect to looper with protocol mapping and pipeline interface. </div> </div> --- ## Thanks for listening!