Introduction to PEPkit | Databio Slides

# Introduction to PEPkit

Nathan Sheffield

<span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span>

</section>
---

## PEP: Portable Encapsulated Projects

---

<div class="bullet">
  <h2><img src="/_modules/pep-format/pep_logo.svg" width="70">PEP format</h2>
</div>

Start with a simple CSV with tabular data.

<hr>
<div class="bullet">
<img src="/_modules/pep-format/file.svg" width="30">samples.csv
</div>

```
sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz
```

---

<div class="bullet">
	<h2><img src="/_modules/pep-format/pep_logo.svg" width="70">PEP format</h2>
</div>

Add a YAML for project-level data.

<hr>
<div class="bullet">
<img src="/_modules/pep-format/file.svg" width="30">samples.csv
</div>

<hr>
<div class="bullet">
<img src="/_modules/pep-format/file.svg" width="30">project_config.yaml
</div>

```yaml
sample_table: /path/to/samples.csv
output_dir: /path/to/output/folder
other_variable: value
```

---

### Add programmatic sample and project modifiers.

<div style="text-align: left">
<span class="bullet"><img src="/_modules/pep-format/replace_white.svg" width="50" class="bullet">Derived attributes</span><br>
<span class="bullet"><img src="/_modules/pep-format/implies_white.svg" width="50" class="bullet">Implied attributes</span><br>
<span class="bullet"><img src="/_modules/pep-format/subproject_white.svg" width="50" class="bullet">Subprojects</span><br>
</div>

---

<span class="bullet"><img src="/_modules/pep-format/replace_white.svg" width="50" class="bullet">Derived attributes</span><br>
<div class="well">Automatically build new sample attributes from existing attributes.</div>

Without derived attribute:

| sample_name | t | protocol | organism | input_file |
|-------------|---|:--------:|----------|------------|
| frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz |
| frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz |
| frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz |
| frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |

Using derived attribute:

| sample_name | t | protocol | organism | input_file |
|-------------|---|:--------:|----------|------------|
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |

---

Project config file:

```yaml
sample_modifiers:
  derive:
    attributes: [input_file]
    sources:
      my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
      your_samples: "/path/to/your/samples/{organism}_{t}h.gz"
```

{variable} identifies sample annotation columns

<div class="well">Benefit: Enables distributed files, portability</div>

---

<span class="bullet"><img src="/_modules/pep-format/implies_white.svg" width="50" class="bullet">Implied attributes</span><br>

<div class="well">Add new sample attributes conditioned on values of existing attributes</div>

<div class="col2">
Before:<br>

| sample_name | protocol | organism |
|-------------|:--------:|----------|
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |

</div>

<div class="col2">
After:<br>

| sample_name | protocol | organism | genome |
|-------------|:--------:|----------|--------|
| human_1 | RNA-seq | human | hg38 |
| human_2 | RNA-seq | human | hg38 |
| human_3 | RNA-seq | human | hg38 |
| mouse_1 | RNA-seq | mouse | mm10 |

</div>

---

| sample_name | protocol | organism |
|-------------|:--------:|----------|
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |

Project config file:

```yaml
sample_modifiers:
  imply:
    - if: 
        organism: human
      then:
        genome: hg38
    - if:
        organism: mouse
      then:
        genome: mm10
```

<div class="well">Benefit: Divides project from sample metadata</div>

---

<span class="bullet"><img src="/_modules/pep-format/subproject.svg" width="50" class="bullet">Subprojects</span><br>

<div class="well">Define activatable project attributes.</div>

```yaml
project_modifiers:
  amendments:
    diverse:
      metadata:
        sample_annotation: psa_rrbs_diverse.csv
    cancer:
      metadata:
        sample_annotation: psa_rrbs_intracancer.csv
```

<div class="well">Benefit: Defines multiple similar projects in a single file</div>
---

# Thank You

<span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> &middot;
<span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> &middot;
<span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span>

</section>
---

We are now in the
  
# Era of Large Biomedical Data

<span class="fragment">Hypothesis: <br><br>

# The most important advances of the future will come from studies that can integrate data from lots of sources

</span>

<div class="fragment">Integrating data introduces 2 major challenges:
<br/>

<ol>
	<li><span class="fragment">Data scale</span></li>
	<li><span class="fragment">Data harmonization</span></li>
</ol>
</div>

---

# Why is data harmonization hard?

<div class="fragment">
Because it's exponential.<br>
Each new dataset adds N additional pairwise comparisons.
<img src="/shorts/pepkit/stars.gif">
</div>

---

# The conundrum

We stand to benefit immensely <br/> from integrating broader and broader data sources.<br><br>

BUT...the wider our integration effort,<br/>  the more challenging the integration.

---

<div>
<img 
	src="/shorts/pepkit/pep_logo_white.svg" 
	width="150"
>
<h3>Pepkit</h3>

A structure and toolkit for organizing large-scale, <br>
sample-intensive biological research projects<br>

<div class="small">
<a href="http://pepkit.github.io/">http://pepkit.github.io/</a><br>
</div>
</div>
<span class="small bullet"><img src="/shorts/pepkit/paper.svg" height="25" class="bullet"><a href="http://dx.doi.org/10.1093/gigascience/giab077">Sheffield et al. (2021).</a> <i>GigaScience</i>.</span>

<br/>
<ul class="fragment">
<li>1. Metadata management</li>
<li>2. Pipeline development</li>
<li>3. Reproducible computing environments</li>
</ul>

---