A GA4GH standard for unique identifiers and compatibility of reference (pan)genomes

# A GA4GH standard for unique identifiers and compatibility of reference (pan)genomes

Nathan Sheffield

<span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span>

</section>
---

# Talk outline

1. Motivation (3 problems with reference genomes)
2. A proposed solution:
    1. GA4GH Refget protocol
    2. GA4GH Sequence collections
    3. Refgenie
3. Ideas for extension to pangenomes

---

.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:18px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
</style>

## Motivation

Many tools require genome-related assets (like indexes).

- Where should we get them?
- How should we identify them (publication/analysis)?
- How should we organize them on disk?

---

## Problem 1

Who is the authoritative provider of the reference genome?

- NCBI?
- UCSC?
- Ensembl?

---

Variation includes:

- hard, soft, or no repeat masking?
- are alternative scaffolds included?
- are haplotypes included?
- how are chromosomes named (chr1, 1, or NC_000001.11)?
- how is the assembly named (hg38, GRCh38, or GCF_000001405.39)?
- Are any decoy sequences included (like EBV)?

---

Andy Yates' "Genome provider analysis"

<table class="tg" style="margin-bottom:60px">
<thead>
  <tr>
    <th class="tg">Provider</th>
    <th class="tg">Chr1 name</th>
    <th class="tg">Chr1 length</th>
    <th class="tg">Chr1 md5</th>
    <th class="tg">Num chroms</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg">Ensembl primary</td>
    <td class="tg">1</td>
    <td class="tg">248956422</td>
    <td class="tg">2648ae1bacce4ec4b6cf337dcae37816</td>
    <td class="tg">195</td>
  </tr>
  <tr>
    <td class="tg">Ensembl toplevel</td>
    <td class="tg">1</td>
    <td class="tg">248956422</td>
    <td class="tg">2648ae1bacce4ec4b6cf337dcae37816</td>
    <td class="tg">649</td>
  </tr>
  <tr>
    <td class="tg">NCBI</td>
    <td class="tg">NC_000001.11</td>
    <td class="tg">248956422</td>
    <td class="tg">6aef897c3d6ff0c78aff06ac189178dd</td>
    <td class="tg">640</td>
  </tr>
  <tr>
    <td class="tg">UCSC</td>
    <td class="tg">chr1</td>
    <td class="tg">248956422</td>
    <td class="tg">2648ae1bacce4ec4b6cf337dcae37816</td>
    <td class="tg">456</td>
  </tr>
</tbody>
</table>

<div class="small">https://gist.github.com/andrewyatz/692f81baab1bebaf09c481937f2ad6c6</div>

---

## Problem 2

How should we identify what we used
(in analysis or publication)?

"hg38"? "GRCh38"?

---

## Problem 3

How should we organize reference assets on disk?<br>

---

These issues:

1. Subtle differences in reference assembly
2. Differences in how they are identified
3. Differences in how they are organized on disk

Lead to analysis challenges:

1. Lack of reproducibility of analysis
2. Lack of reusability of results
3. Lack of reusability of tools

<span class="fragment">*What are some solutions?*</span>

---

## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer

iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*.

You download a tarball of a standard structure for your genome of interest, then write tools off that.

---

## The 'central repository' approach is limited

- *Not scripted.* No iGenomes for an arbitrary genome/asset.
- *Not modular*. No access to individual assets.
- *Not programmatic*. Can't access data/metadata via API.
- *Identifiers by central authority*. Who put Illumina in charge?

---

## An alternative solution

Refget → Sequence collections → Refgenie

---

## Refget

Refget enables access to reference sequences <br>
using an identifier derived from the sequence itself.

<div class="small">
<a href="http://samtools.github.io/hts-specs/refget.html">http://samtools.github.io/hts-specs/refget.html</a><br>
<span class="small bullet"><img src="/_modules/refgenie-intro/paper.svg" height="25" class="bullet"><a href="10.1093/bioinformatics/btab524">Yates et al. (2022)</a>. <i>Bioinformatics</i>.</span><br/>
</div>

---

## How refget works

## Limitations

- only handles a single sequence
- excludes chromosome names
- no capacity for annotation

</div>

---

## Extending to sequence collections

We need:

1. An algorithm to create a deterministic, unique digest from a collection of sequences
2. A server capable of retrieving sequences given an identifier

---

## First pass: Refgenie approach

<span class="small bullet"><img src="/_modules/refgenie-intro/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021)</a>. <i>NAR Genomics and Bioinformatics</i>.</span><br/>

---

## Limitations and discussion

- Should we include sequence topology in the digest?
- What other attributes could we include?
- Are there better delimiters?
- How do we construct the 'string-to-digest'?
- How do we handle order of sequences?
- How should the API respond to requests?

---

## Project goal

- to standardize unique identifiers for collections of sequences
- can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences

## The project specifies

- an algorithm for computing sequence identifiers from collections
- a lookup API to retrieve a collection given an identifier
- a comparison API to assess compatibility of two collections

---

## How do we digest a sequence collection?

JSON object: each sequence collection attribute is a property

```json
{
  "lengths": [
    248956422,
    242193529,
    198295559
  ],
  "names": [
    "chr1",
    "chr2",
    "chr3"
  ],
  "sequences": [
    "6aef897c3d6ff0c78aff06ac189178dd",
    "f98db672eb0993dcfdabafe2a882905c",
    "76635a41ea913a405ded820447d067b0"
  ]
}
```

</div>
<div class="col2">

<br>← length of the sequences
<br>← names of the sequences
<br>← refget digests

</div>

---

## How do we digest a sequence collection?

You can drop the sequences attribute:

</div>
<div class="col2">

```json
{
  "lengths": [
    248956422,
    242193529,
    198295559
  ],
  "names": [
    "chr1",
    "chr2",
    "chr3"
  ]
}
```

</div>

---

## How do we digest a sequence collection?

Or add a topology attribute:

</div>
<div class="col2">

</div>

---

## Digest algorithm

1. Canonicalize each attribute following [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785)
2. Digest each string (GA4GH digest: SHA512 truncated to 24 bits, converted to 64)
3. Canonicalize the entire object with RFC-8785
4. Digest the canonicalized string

---

Example

<span class="small">Slide by Tim Cezard</span>

---

## Advantages

&#10004; Accommodates new attributes with backwards-compatibility
&#10004; Additional layer of recursion to assess individual attributes
&#10004; Relies on existing JCS standard for string encoding

---

## Comparison function

- seqcol 1: 047c6e1eda552b50c5add59ff0995
- seqcol 2: 2230c535660fb4774114bfa966a62

### How compatible are they?

Comparison endpoint

`GET /compare/:digest1/:digest2`

---

```
GET /compare/59319772d1bcf2e0dd4b8a296f2d9682/2e7bc302a54ecec62d8155e19fbf2748
```

Response:

```json
{
  "digests": {
    "a": "59319772d1bcf2e0dd4b8a296f2d9682",
    "b": "2e7bc302a54ecec62d8155e19fbf2748"
  },
  "arrays": {
    "a-only": [],
    "b-only": [],
    "a-and-b": [
      "lengths",
      "names",
      "sequences",
      "names_lengths"
    ]
  },
  "elements": {
    "total": {
      "a": 3,
      "b": 3
    },
    "a-and-b": {
      "lengths": 3,
      "names": 3,
      "sequences": 3,
      "names_lengths": 3
    },
    "a-and-b-same-order": {
      "lengths": false,
      "names": false,
      "sequences": false,
      "names_lengths": true
    }
  }
}
```

---

A full-service reference genome manager.

<div class="small">
<a href="http://refgenie.databio.org">http://refgenie.databio.org</a><br>
</div>

<span class="small bullet"><img src="/_modules/refgenie-intro/paper.svg" height="25" class="bullet"><a href="https://www.biorxiv.org/content/10.1093/gigascience/giz149">Stolarczyk et al. (2020).</a> <i>GigaScience</i>.</span><br/>
<span class="small bullet"><img src="/_modules/refgenie-intro/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021).</a> <i>NAR Genomics and Bioinformatics</i>.</span><br/>

---

## Refgenie provides:

- *Two ways to retrieve an asset.*
  - `build` any asset from a recipe.
  - `pull` any individual asset from a server
- *Better discoverability.*
  - `list/listr` shows assets
  - `refgenieserver` is a browseable web interface and API
- *Managed locations.*
  - `seek` returns the local path to assets
  - `add/remove` to manage your own assets

---

## Refgenie splits tasks between CLI and server

---

## Refgenie CLI example

[http://refgenie.databio.org](http://refgenie.databio.org)

```console
$ pip install --user refgenie
```

List available remote assets with `list`:

```console
$ refgenie listr

Remote refgenie assets
                 Server URL: http://refgenomes.databio.org
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ genome              ┃ assets                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ mouse_chrM2x        │ fasta, bwa_index, bowtie2_index              │
│ hg38                │ fasta, bowtie2_index                         │
│ rCRSd               │ fasta, bowtie2_index                         │
│ human_repeats       │ fasta, hisat2_index, bwa_index               │
└─────────────────────┴──────────────────────────────────────────────┘
```

Retrieve a remote asset path with `seekr`:

```console
$ refgenie seekr hg38/fasta
http://awspds.refgenie.databio.org/refgenomes.databio.org/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta__default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.fa

```

---

Download a remote asset with `pull`:

```console
$ refgenie pull hg38/bowtie2_index

Downloading URL: http://rg.databio.org/v3/assets/archive/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index
94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index:default ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 128.0/117.0 KB • 1.8 MB/s • 0:00:00
Download complete: /Users/mstolarczyk/Desktop/testing/refgenie/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index/bowtie2_index__default.tgz
Extracting asset tarball: /Users/mstolarczyk/Desktop/testing/refgenie/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index/bowtie2_index__default.tgz
Default tag for '94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index' set to: default
Created alias directories:
 - /Users/mstolarczyk/Desktop/testing/refgenie/alias/hg38/bowtie2_index/default
```

Retrieve a local asset path with `seek`:

```console
$ refgenie seek hg38/bowtie2_index

/project/shefflab/genomes_v04_210301/alias/hg38/bowtie2_index/default/hg38
```

---

Build a new local asset with `build`:

```console
$ refgenie build mygenome/bwa_index

Saving outputs to:
- content: /project/shefflab/genomes_v04_210301/data/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4
- logs: /project/shefflab/genomes_v04_210301/data/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bwa_index/default/_refgenie_build

### Pipeline run code and environment:

*              Command:  `/home/ns5bc/.local/bin/refgenie build hg38/bwa_index`
*         Compute host:  udc-ba34-36
*          Working dir:  /sfs/qumulo/qhome/ns5bc
*            Outfolder:  /project/shefflab/genomes_v04_210301/data/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bwa_index/default/_refgenie_build/
*  Pipeline started at:   (06-14 07:43:16) elapsed: 1.0 _TIME_
...
### Pipeline completed. Epilogue
*        Elapsed time (this run):  1:06:42
*  Total elapsed time (all runs):  1:10:32
*         Peak memory (this run):  10.8044 GB
*        Pipeline completed time: 2023-06-14 07:43:28
Finished building 'bwa_index' asset

```

---

## How can we extend this to pangenomes?

<div class="fragment">
We can recurse one layer further

---

Recall the sequence collection structure

</div>
<div class="col2">

<br>← length of the sequences
<br>← names of the sequences
<br>← refget digests

</div>

---

A pangenome is a collection of sequence collections

```json
{
  "lengths": [
    247,
    215,
    168,
    129,
    127
  ],
  "names": [
    "HG00099_pat",
    "HG00140_pat",
    "HG00280_pat",
    "HG00323_pat",
    "HG00408_pat"
  ],
  "collections": [
    "31fc6ca291a32fb9df82b85e5f077e31",
    "92c6a56c9e9459d8a42b96f7884710bc",
    "5f63cfaa3ef61f88c9635fb9d18ec945",
    "71981d019c54defbccd8c6d00858f97e",
    "4fd60ab00ce73271e4c729ecba284fe6c"
  ]
}
```

</div>
<div class="col2">

<br>← number of elements
<br><br>← names of the haplotypes
<br><br>← seqcol digests

</div>

---

1. Computable pangenome identifiers

```json
{
  "lengths": [ ... ],
  "names": [ ... ],
  "collections": [ ... ]
}
```

*→* `<pangenome_digest>`

<hr>

2. Retrievable pangenomes

`GET /pangenome/<pangenome_digest>`
*→* retrieve pangenome structure

<hr>

3. Comparison of pangenome compatibility

`GET /compare/<pg_digest1>/<pg_digest2>`
*→* compare pangenome contents

---

## 4. Refgenie for pangenomes

```console
$ refgenie pull hprc-yr1/vg_index
```

Pangenome-derived assets could live alongside linear genome assets, easing transition of users to pangenome analysis

---

## Summary

- Sequence collections can create universal, deterministic identifiers and comparison for reference genomes
- Refgenie is one example that will benefit from this for simplifying distribution of reference genome assets
- Extension to pangenomes provides a robust ecosystem for identifying and distributing pangenomes and related assets
---

## Sequence Collections

Unique identifiers and API for sequence collections.

<div class="small">
<a href="https://seqcol.readthedocs.io">https://seqcol.readthedocs.io</a>
</div>

<span class="small bullet"><img src="/_modules/seqcol/icons/paper.svg" height="25" class="bullet"> <a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021)</a>. <i>NAR Genomics and Bioinformatics</i>.</span>

---

## Problem

### Who is the authoritative provider of the reference genome?

- NCBI?
- UCSC?
- Ensembl?

---

## Variation includes:

---

## Genome provider analysis (Andy Yates)

| Provider | Chr1 name | Chr1 length | Chr1 md5 | Num chroms |
|----------|-----------|-------------|----------|------------|
| Ensembl primary | 1 | 248956422 | 2648ae1bacce4ec4b6cf337dcae37816 | 195 |
| Ensembl toplevel | 1 | 248956422 | 2648ae1bacce4ec4b6cf337dcae37816 | 649 |
| NCBI | NC_000001.11 | 248956422 | 6aef897c3d6ff0c78aff06ac189178dd | 640 |
| UCSC | chr1 | 248956422 | 2648ae1bacce4ec4b6cf337dcae37816 | 456 |

<div class="small">https://gist.github.com/andrewyatz/692f81baab1bebaf09c481937f2ad6c6</div>

---

## Subtle differences lead to:

1. Lack of reproducibility of analysis
2. Lack of reusability of results

### Solution

Refget → Sequence collections
</div>

---

## Refget

Refget enables access to reference sequences
using an identifier derived from the sequence itself.

<div class="small">
<a href="http://samtools.github.io/hts-specs/refget.html">http://samtools.github.io/hts-specs/refget.html</a>
</div>

---

## How refget works

### Limitations

- only handles a single sequence
- excludes chromosome names
- no capacity for annotation
</div>
</div>

---

## Extending to sequence collections

We need:

1. An algorithm to create a deterministic, unique digest from a collection of sequences
2. A server capable of retrieving sequences given an identifier

---

## Refgenie approach

---

## refgenomes.databio.org

<a href="http://refgenomes.databio.org/">refgenomes.databio.org</a>

---

## Limitations and discussion

---

## Project goal

- to standardize unique identifiers for collections of sequences
- can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences

## The project specifies:

- an algorithm for computing sequence identifiers from collections
- a lookup API to retrieve a collection given an identifier
- a comparison API to assess compatibility of two collections

---

## How do we digest a sequence collection?

JSON object: each sequence collection attribute is a property

```json
{
  "lengths": [4, 4, 8],
  "names": ["chr1", "chr2", "chrX"],
  "sequences": [
    "31fc6ca291a32fb9df82b85e5f077e31",
    "92c6a56c9e9459d8a42b96f7884710bc",
    "5f63cfaa3ef61f88c9635fb9d18ec945"
  ]
}
```

---

## You can drop the sequences attribute

```json
{
  "lengths": [4, 4, 8],
  "names": ["chr1", "chr2", "chrX"],
  "sequences": [
    "31fc6ca291a32fb9df...",
    "92c6a56c9e9459d8a4...",
    "5f63cfaa3ef61f88c9..."
  ]
}
```

</div>
<div style="width: 45%; background:#222">

```json
{
  "lengths": [4, 4, 8],
  "names": ["chr1", "chr2", "chrX"]
}
```

</div>
</div>

---

## Or add a topology attribute

```json
{
  "lengths": [4, 4, 8],
  "names": ["chr1", "chr2", "chrX"],
  "sequences": [
    "31fc6ca291a32fb9df...",
    "92c6a56c9e9459d8a4...",
    "5f63cfaa3ef61f88c9..."
  ],
  "topologies": ["linear", "linear", "circular"]
}
```

---

## Digest algorithm

1. Canonicalize each attribute following RFC-8785 (JSON Canonicalization Scheme)
2. Digest each string (GA4GH digest: SHA512 truncated to 24 bits, converted to base64)
3. Canonicalize the entire object
4. Digest the canonicalized string

---

## Example

Tim Cezard

---

## Advantages

- Accommodates new attributes with backwards-compatibility
- Additional layer of recursion to assess individual attributes
- Relies on existing JCS standard for string encoding

---

## What gets digested?

- Inherent attributes are included in the calculation of the identifier
- Non-inherent attributes enable storing additional metadata, comparison helpers, etc.
- These are specified using a [schema](https://seqcol.readthedocs.io/en/latest/decision_record/#2022-06-15-we-will-define-the-elements-of-a-sequence-collections-using-a-schema)

---

## Comparison function

- seqcol 1: `047c6e1eda552b50c5add59ff0995`
- seqcol 2: `2230c535660fb4774114bfa966a62`

### How compatible are they?

Comparison endpoint

---

## Comparison result

```json
{
  "digests": {
    "a": "59319772d1bcf2e0dd4b8a296f2d9682",
    "b": "2e7bc302a54ecec62d8155e19fbf2748"
  },
  "arrays": {
    "a-only": [], "b-only": [],
    "a-and-b": ["lengths", "names", "sequences", "names_lengths"]
  },
  "elements": {
    "total": {"a": 3, "b": 3},
    "a-and-b": {"lengths": 3, "names": 3, "sequences": 3, "names_lengths": 3},
    "a-and-b-same-order": {
      "lengths": false, "names": false,
      "sequences": false, "names_lengths": true
    }
  }
}
```

---

## Seqcol API demonstration

<a href="https://seqcolapi.databio.org/">https://seqcolapi.databio.org/</a>

---

## API endpoints

- `GET /service-info`
- `GET /collection/:digest`
- `GET /comparison/:digest1/:digest2`
- `POST /comparison/:digest1`

---

## Conclusions

- Refget provides universal IDs for individual sequences
- Sequence collections extends this to reference genomes
- Using a deterministic algorithm, you can find the identifier
- A lookup service can retrieve the original sequence
- A comparison function allows fine-grained compatibility tests
- Please follow along: https://github.com/ga4gh/seqcol-spec
---

# Thank You

<span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> &middot;
<span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> &middot;
<span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span>

</section>