<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # A GA4GH standard for unique identifiers and compatibility of reference (pan)genomes Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- # Talk outline 1. Motivation (3 problems with reference genomes) 2. A proposed solution: 1. GA4GH Refget protocol 2. GA4GH Sequence collections 3. Refgenie 3. Ideas for extension to pangenomes --- <style type="text/css"> .wrap { width: 1400px; height: 1000px; overflow: hidden; } .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:18px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} </style> ## Motivation Many tools require genome-related assets (like indexes). - Where should we get them? - How should we identify them (publication/analysis)? - How should we organize them on disk? --- ## Problem 1 Who is the authoritative provider of the reference genome? - NCBI? - UCSC? - Ensembl? --- Variation includes: - hard, soft, or no repeat masking? - are alternative scaffolds included? - are haplotypes included? - how are chromosomes named (chr1, 1, or NC_000001.11)? - how is the assembly named (hg38, GRCh38, or GCF_000001405.39)? - Are any decoy sequences included (like EBV)? --- Andy Yates' "Genome provider analysis" <table class="tg" style="margin-bottom:60px"> <thead> <tr> <th class="tg">Provider</th> <th class="tg">Chr1 name</th> <th class="tg">Chr1 length</th> <th class="tg">Chr1 md5</th> <th class="tg">Num chroms</th> </tr> </thead> <tbody> <tr> <td class="tg">Ensembl primary</td> <td class="tg">1</td> <td class="tg">248956422</td> <td class="tg">2648ae1bacce4ec4b6cf337dcae37816</td> <td class="tg">195</td> </tr> <tr> <td class="tg">Ensembl toplevel</td> <td class="tg">1</td> <td class="tg">248956422</td> <td class="tg">2648ae1bacce4ec4b6cf337dcae37816</td> <td class="tg">649</td> </tr> <tr> <td class="tg">NCBI</td> <td class="tg">NC_000001.11</td> <td class="tg">248956422</td> <td class="tg">6aef897c3d6ff0c78aff06ac189178dd</td> <td class="tg">640</td> </tr> <tr> <td class="tg">UCSC</td> <td class="tg">chr1</td> <td class="tg">248956422</td> <td class="tg">2648ae1bacce4ec4b6cf337dcae37816</td> <td class="tg">456</td> </tr> </tbody> </table> <div class="small">https://gist.github.com/andrewyatz/692f81baab1bebaf09c481937f2ad6c6</div> --- ## Problem 2 How should we identify what we used (in analysis or publication)? "hg38"? "GRCh38"? --- ## Problem 3 How should we organize reference assets on disk?<br> <img src="/_modules/refgenie-intro/folder_structures.svg" style="background:white" width="600"> --- These issues: 1. Subtle differences in reference assembly 2. Differences in how they are identified 3. Differences in how they are organized on disk Lead to analysis challenges: 1. Lack of reproducibility of analysis 2. Lack of reusability of results 3. Lack of reusability of tools <span class="fragment">*What are some solutions?*</span> --- ## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*. You download a tarball of a standard structure for your genome of interest, then write tools off that. --- ## The 'central repository' approach is limited - *Not scripted.* No iGenomes for an arbitrary genome/asset. - *Not modular*. No access to individual assets. - *Not programmatic*. Can't access data/metadata via API. - *Identifiers by central authority*. Who put Illumina in charge? --- ## An alternative solution <img src="/_modules/refgenie-intro/ga4gh.png" style="background:white; padding:25px" width="400"><br/> Refget → Sequence collections → Refgenie --- ## Refget Refget enables access to reference sequences <br> using an identifier derived from the sequence itself. <div class="small"> <a href="http://samtools.github.io/hts-specs/refget.html">http://samtools.github.io/hts-specs/refget.html</a><br> <span class="small bullet"><img src="/_modules/refgenie-intro/paper.svg" height="25" class="bullet"><a href="10.1093/bioinformatics/btab524">Yates et al. (2022)</a>. <i>Bioinformatics</i>.</span><br/> </div> --- ## How refget works <div class="col2"> <img src="/_modules/refgenie-intro/refget_concept.svg" style="background:white" width="400"> </div> <div class="col2 fragment"> ## Limitations - only handles a single sequence - excludes chromosome names - no capacity for annotation </div> --- ## Extending to sequence collections We need: 1. An algorithm to create a deterministic, unique digest from a collection of sequences 2. A server capable of retrieving sequences given an identifier --- ## First pass: Refgenie approach <div class="col2"> <img src="/_modules/refgenie-intro/refgenie_seqcol_digest.svg" style="background:white; padding:15px" width="400"> </div> <div class="col2"> <img src="/_modules/refgenie-intro/seqcol_lookup.svg" style="background:white; padding:15px" width="400"> </div> <span class="small bullet"><img src="/_modules/refgenie-intro/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021)</a>. <i>NAR Genomics and Bioinformatics</i>.</span><br/> --- ## Limitations and discussion - Should we include sequence topology in the digest? - What other attributes could we include? - Are there better delimiters? - How do we construct the 'string-to-digest'? - How do we handle order of sequences? - How should the API respond to requests? --- ## Project goal - to standardize unique identifiers for collections of sequences - can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences ## The project specifies - an algorithm for computing sequence identifiers from collections - a lookup API to retrieve a collection given an identifier - a comparison API to assess compatibility of two collections --- ## How do we digest a sequence collection? JSON object: each sequence collection attribute is a property <div class="col2"> ```json { "lengths": [ 248956422, 242193529, 198295559 ], "names": [ "chr1", "chr2", "chr3" ], "sequences": [ "6aef897c3d6ff0c78aff06ac189178dd", "f98db672eb0993dcfdabafe2a882905c", "76635a41ea913a405ded820447d067b0" ] } ``` </div> <div class="col2"> <br>← length of the sequences <br>← names of the sequences <br>← refget digests </div> --- ## How do we digest a sequence collection? You can drop the sequences attribute: <div class="col2"> ```json { "lengths": [ 248956422, 242193529, 198295559 ], "names": [ "chr1", "chr2", "chr3" ], "sequences": [ "6aef897c3d6ff0c78aff06ac189178dd", "f98db672eb0993dcfdabafe2a882905c", "76635a41ea913a405ded820447d067b0" ] } ``` </div> <div class="col2"> ```json { "lengths": [ 248956422, 242193529, 198295559 ], "names": [ "chr1", "chr2", "chr3" ] } ``` </div> --- ## How do we digest a sequence collection? Or add a topology attribute: <div class="col2"> ```json { "lengths": [ 248956422, 242193529, 198295559 ], "names": [ "chr1", "chr2", "chr3" ], "sequences": [ "6aef897c3d6ff0c78aff06ac189178dd", "f98db672eb0993dcfdabafe2a882905c", "76635a41ea913a405ded820447d067b0" ] } ``` </div> <div class="col2"> ```json { "lengths": [ 248956422, 242193529, 198295559 ], "names": [ "chr1", "chr2", "chr3" ], "sequences": [ "6aef897c3d6ff0c78aff06ac189178dd", "f98db672eb0993dcfdabafe2a882905c", "76635a41ea913a405ded820447d067b0" ], "topologies": [ "linear", "linear", "linear" ] } ``` </div> --- ## Digest algorithm 1. Canonicalize each attribute following [RFC-8785 JSON Canonicalization Scheme](https://www.rfc-editor.org/rfc/rfc8785) 2. Digest each string (GA4GH digest: SHA512 truncated to 24 bits, converted to 64) 3. Canonicalize the entire object with RFC-8785 4. Digest the canonicalized string --- Example <img src="/_modules/refgenie-intro/digest_algorithm_2.png"> <span class="small">Slide by Tim Cezard</span> --- ## Advantages ✔ Accommodates new attributes with backwards-compatibility ✔ Additional layer of recursion to assess individual attributes ✔ Relies on existing JCS standard for string encoding --- ## Comparison function <table class="tg" style="margin-bottom:60px"> <thead> <tr> <th class="tg">Provider</th> <th class="tg">Chr1 name</th> <th class="tg">Chr1 length</th> <th class="tg">Chr1 md5</th> <th class="tg">Num chroms</th> </tr> </thead> <tbody> <tr> <td class="tg">Ensembl primary</td> <td class="tg">1</td> <td class="tg">248956422</td> <td class="tg">2648ae1bacce4ec4b6cf337dcae37816</td> <td class="tg">195</td> </tr> <tr> <td class="tg">Ensembl toplevel</td> <td class="tg">1</td> <td class="tg">248956422</td> <td class="tg">2648ae1bacce4ec4b6cf337dcae37816</td> <td class="tg">649</td> </tr> <tr> <td class="tg">NCBI</td> <td class="tg">NC_000001.11</td> <td class="tg">248956422</td> <td class="tg">6aef897c3d6ff0c78aff06ac189178dd</td> <td class="tg">640</td> </tr> <tr> <td class="tg">UCSC</td> <td class="tg">chr1</td> <td class="tg">248956422</td> <td class="tg">2648ae1bacce4ec4b6cf337dcae37816</td> <td class="tg">456</td> </tr> </tbody> </table> - seqcol 1: 047c6e1eda552b50c5add59ff0995 - seqcol 2: 2230c535660fb4774114bfa966a62 ### How compatible are they? Comparison endpoint `GET /compare/:digest1/:digest2` --- ``` GET /compare/59319772d1bcf2e0dd4b8a296f2d9682/2e7bc302a54ecec62d8155e19fbf2748 ``` Response: ```json { "digests": { "a": "59319772d1bcf2e0dd4b8a296f2d9682", "b": "2e7bc302a54ecec62d8155e19fbf2748" }, "arrays": { "a-only": [], "b-only": [], "a-and-b": [ "lengths", "names", "sequences", "names_lengths" ] }, "elements": { "total": { "a": 3, "b": 3 }, "a-and-b": { "lengths": 3, "names": 3, "sequences": 3, "names_lengths": 3 }, "a-and-b-same-order": { "lengths": false, "names": false, "sequences": false, "names_lengths": true } } } ``` --- <img src="/_modules/refgenie-intro/refgenie_logo_light.svg" style="padding-top:25px; padding-bottom:25px; width: 350px"> <br> A full-service reference genome manager. <div class="small"> <a href="http://refgenie.databio.org">http://refgenie.databio.org</a><br> </div> <span class="small bullet"><img src="/_modules/refgenie-intro/paper.svg" height="25" class="bullet"><a href="https://www.biorxiv.org/content/10.1093/gigascience/giz149">Stolarczyk et al. (2020).</a> <i>GigaScience</i>.</span><br/> <span class="small bullet"><img src="/_modules/refgenie-intro/paper.svg" height="25" class="bullet"><a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021).</a> <i>NAR Genomics and Bioinformatics</i>.</span><br/> --- ## Refgenie provides: - *Two ways to retrieve an asset.* - `build` any asset from a recipe. - `pull` any individual asset from a server - *Better discoverability.* - `list/listr` shows assets - `refgenieserver` is a browseable web interface and API - *Managed locations.* - `seek` returns the local path to assets - `add/remove` to manage your own assets --- ## Refgenie splits tasks between CLI and server <img src="/_modules/refgenie-intro/refgenie_interfaces.svg" style="background:white" width="500"> --- ## Refgenie CLI example [http://refgenie.databio.org](http://refgenie.databio.org) ```console $ pip install --user refgenie ``` List available remote assets with `list`: ```console $ refgenie listr Remote refgenie assets Server URL: http://refgenomes.databio.org ┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ genome ┃ assets ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ mouse_chrM2x │ fasta, bwa_index, bowtie2_index │ │ hg38 │ fasta, bowtie2_index │ │ rCRSd │ fasta, bowtie2_index │ │ human_repeats │ fasta, hisat2_index, bwa_index │ └─────────────────────┴──────────────────────────────────────────────┘ ``` Retrieve a remote asset path with `seekr`: ```console $ refgenie seekr hg38/fasta http://awspds.refgenie.databio.org/refgenomes.databio.org/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta__default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.fa ``` --- Download a remote asset with `pull`: ```console $ refgenie pull hg38/bowtie2_index Downloading URL: http://rg.databio.org/v3/assets/archive/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index 94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index:default ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 128.0/117.0 KB • 1.8 MB/s • 0:00:00 Download complete: /Users/mstolarczyk/Desktop/testing/refgenie/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index/bowtie2_index__default.tgz Extracting asset tarball: /Users/mstolarczyk/Desktop/testing/refgenie/data/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index/bowtie2_index__default.tgz Default tag for '94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index' set to: default Created alias directories: - /Users/mstolarczyk/Desktop/testing/refgenie/alias/hg38/bowtie2_index/default ``` Retrieve a local asset path with `seek`: ```console $ refgenie seek hg38/bowtie2_index /project/shefflab/genomes_v04_210301/alias/hg38/bowtie2_index/default/hg38 ``` --- Build a new local asset with `build`: ```console $ refgenie build mygenome/bwa_index Saving outputs to: - content: /project/shefflab/genomes_v04_210301/data/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 - logs: /project/shefflab/genomes_v04_210301/data/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bwa_index/default/_refgenie_build ### Pipeline run code and environment: * Command: `/home/ns5bc/.local/bin/refgenie build hg38/bwa_index` * Compute host: udc-ba34-36 * Working dir: /sfs/qumulo/qhome/ns5bc * Outfolder: /project/shefflab/genomes_v04_210301/data/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bwa_index/default/_refgenie_build/ * Pipeline started at: (06-14 07:43:16) elapsed: 1.0 _TIME_ ... ### Pipeline completed. Epilogue * Elapsed time (this run): 1:06:42 * Total elapsed time (all runs): 1:10:32 * Peak memory (this run): 10.8044 GB * Pipeline completed time: 2023-06-14 07:43:28 Finished building 'bwa_index' asset ``` --- ## How can we extend this to pangenomes? <div class="fragment"> We can recurse one layer further <img src="/_modules/refgenie-intro/checksum_recursion_2.svg" style="background-color:white; width:720px"> </div> --- Recall the sequence collection structure <div class="col2"> ```json { "lengths": [ 248956422, 242193529, 198295559 ], "names": [ "chr1", "chr2", "chr3" ], "sequences": [ "6aef897c3d6ff0c78aff06ac189178dd", "f98db672eb0993dcfdabafe2a882905c", "76635a41ea913a405ded820447d067b0" ] } ``` </div> <div class="col2"> <br>← length of the sequences <br>← names of the sequences <br>← refget digests </div> --- A pangenome is a collection of sequence collections <div class="col2"> ```json { "lengths": [ 247, 215, 168, 129, 127 ], "names": [ "HG00099_pat", "HG00140_pat", "HG00280_pat", "HG00323_pat", "HG00408_pat" ], "collections": [ "31fc6ca291a32fb9df82b85e5f077e31", "92c6a56c9e9459d8a42b96f7884710bc", "5f63cfaa3ef61f88c9635fb9d18ec945", "71981d019c54defbccd8c6d00858f97e", "4fd60ab00ce73271e4c729ecba284fe6c" ] } ``` </div> <div class="col2"> <br>← number of elements <br><br>← names of the haplotypes <br><br>← seqcol digests </div> --- 1. Computable pangenome identifiers ```json { "lengths": [ ... ], "names": [ ... ], "collections": [ ... ] } ``` *→* `<pangenome_digest>` <hr> 2. Retrievable pangenomes `GET /pangenome/<pangenome_digest>` *→* retrieve pangenome structure <hr> 3. Comparison of pangenome compatibility `GET /compare/<pg_digest1>/<pg_digest2>` *→* compare pangenome contents --- ## 4. Refgenie for pangenomes ```console $ refgenie pull hprc-yr1/vg_index ``` Pangenome-derived assets could live alongside linear genome assets, easing transition of users to pangenome analysis --- ## Summary - Sequence collections can create universal, deterministic identifiers and comparison for reference genomes - Refgenie is one example that will benefit from this for simplifying distribution of reference genome assets - Extension to pangenomes provides a robust ecosystem for identifying and distributing pangenomes and related assets --- ## Sequence Collections Unique identifiers and API for sequence collections. <div class="small"> <a href="https://seqcol.readthedocs.io">https://seqcol.readthedocs.io</a> </div> <span class="small bullet"><img src="/_modules/seqcol/icons/paper.svg" height="25" class="bullet"> <a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021)</a>. <i>NAR Genomics and Bioinformatics</i>.</span> --- ## Problem ### Who is the authoritative provider of the reference genome? - NCBI? - UCSC? - Ensembl? --- ## Variation includes: - hard, soft, or no repeat masking? - are alternative scaffolds included? - are haplotypes included? - how are chromosomes named (chr1, 1, or NC_000001.11)? - how is the assembly named (hg38, GRCh38, or GCF_000001405.39)? - Are any decoy sequences included (like EBV)? --- ## Genome provider analysis (Andy Yates) | Provider | Chr1 name | Chr1 length | Chr1 md5 | Num chroms | |----------|-----------|-------------|----------|------------| | Ensembl primary | 1 | 248956422 | 2648ae1bacce4ec4b6cf337dcae37816 | 195 | | Ensembl toplevel | 1 | 248956422 | 2648ae1bacce4ec4b6cf337dcae37816 | 649 | | NCBI | NC_000001.11 | 248956422 | 6aef897c3d6ff0c78aff06ac189178dd | 640 | | UCSC | chr1 | 248956422 | 2648ae1bacce4ec4b6cf337dcae37816 | 456 | <div class="small">https://gist.github.com/andrewyatz/692f81baab1bebaf09c481937f2ad6c6</div> --- ## Subtle differences lead to: 1. Lack of reproducibility of analysis 2. Lack of reusability of results <div class="fragment"> ### Solution <img src="/_modules/seqcol/ga4gh.png" style="background:white; padding:25px" width="400"> Refget → Sequence collections </div> --- ## Refget Refget enables access to reference sequences using an identifier derived from the sequence itself. <div class="small"> <a href="http://samtools.github.io/hts-specs/refget.html">http://samtools.github.io/hts-specs/refget.html</a> </div> --- ## How refget works <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <img src="/_modules/seqcol/refget_concept.svg" style="background:white" width="100%"> </div> <div class="fragment" style="width: 45%;"> ### Limitations - only handles a single sequence - excludes chromosome names - no capacity for annotation </div> </div> --- ## Extending to sequence collections We need: 1. An algorithm to create a deterministic, unique digest from a collection of sequences 2. A server capable of retrieving sequences given an identifier --- ## Refgenie approach <div style="display: flex; justify-content: space-between;"> <div style="width: 45%;"> <img src="/_modules/seqcol/refgenie_seqcol_digest.svg" style="background:white; padding:15px" width="100%"> </div> <div style="width: 45%;"> <img src="/_modules/seqcol/seqcol_lookup.svg" style="background:white; padding:15px" width="100%"> </div> </div> <span class="small bullet"><img src="/_modules/seqcol/icons/paper.svg" height="25" class="bullet"> <a href="https://doi.org/10.1093/nargab/lqab036">Stolarczyk, Xue, and Sheffield (2021)</a>. <i>NAR Genomics and Bioinformatics</i>.</span> --- ## refgenomes.databio.org <img src="/_modules/seqcol/refgenomes_screenshot.png" style="background:white" width="800"> <a href="http://refgenomes.databio.org/">refgenomes.databio.org</a> --- ## Limitations and discussion - Should we include sequence topology in the digest? - What other attributes could we include? - Are there better delimiters? - How do we construct the 'string-to-digest'? - How do we handle order of sequences? - How should the API respond to requests? --- ## Project goal - to standardize unique identifiers for collections of sequences - can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences ## The project specifies: - an algorithm for computing sequence identifiers from collections - a lookup API to retrieve a collection given an identifier - a comparison API to assess compatibility of two collections --- ## How do we digest a sequence collection? JSON object: each sequence collection attribute is a property ```json { "lengths": [4, 4, 8], "names": ["chr1", "chr2", "chrX"], "sequences": [ "31fc6ca291a32fb9df82b85e5f077e31", "92c6a56c9e9459d8a42b96f7884710bc", "5f63cfaa3ef61f88c9635fb9d18ec945" ] } ``` --- ## You can drop the sequences attribute <div style="display: flex; justify-content: space-between;"> <div style="width: 45%; background:#444"> ```json { "lengths": [4, 4, 8], "names": ["chr1", "chr2", "chrX"], "sequences": [ "31fc6ca291a32fb9df...", "92c6a56c9e9459d8a4...", "5f63cfaa3ef61f88c9..." ] } ``` </div> <div style="width: 45%; background:#222"> ```json { "lengths": [4, 4, 8], "names": ["chr1", "chr2", "chrX"] } ``` </div> </div> --- ## Or add a topology attribute ```json { "lengths": [4, 4, 8], "names": ["chr1", "chr2", "chrX"], "sequences": [ "31fc6ca291a32fb9df...", "92c6a56c9e9459d8a4...", "5f63cfaa3ef61f88c9..." ], "topologies": ["linear", "linear", "circular"] } ``` --- ## Digest algorithm 1. Canonicalize each attribute following RFC-8785 (JSON Canonicalization Scheme) 2. Digest each string (GA4GH digest: SHA512 truncated to 24 bits, converted to base64) 3. Canonicalize the entire object 4. Digest the canonicalized string --- ## Example <img src="/_modules/seqcol/digest_algorithm_2.png"> Tim Cezard --- ## Advantages - Accommodates new attributes with backwards-compatibility - Additional layer of recursion to assess individual attributes - Relies on existing JCS standard for string encoding --- ## What gets digested? - Inherent attributes are included in the calculation of the identifier - Non-inherent attributes enable storing additional metadata, comparison helpers, etc. - These are specified using a [schema](https://seqcol.readthedocs.io/en/latest/decision_record/#2022-06-15-we-will-define-the-elements-of-a-sequence-collections-using-a-schema) --- ## Comparison function | Provider | Chr1 name | Chr1 length | Chr1 md5 | Num chroms | |----------|-----------|-------------|----------|------------| | Ensembl primary | 1 | 248956422 | 2648ae1bacce4ec4b6cf337dcae37816 | 195 | | Ensembl toplevel | 1 | 248956422 | 2648ae1bacce4ec4b6cf337dcae37816 | 649 | | NCBI | NC_000001.11 | 248956422 | 6aef897c3d6ff0c78aff06ac189178dd | 640 | | UCSC | chr1 | 248956422 | 2648ae1bacce4ec4b6cf337dcae37816 | 456 | - seqcol 1: `047c6e1eda552b50c5add59ff0995` - seqcol 2: `2230c535660fb4774114bfa966a62` ### How compatible are they? Comparison endpoint --- ## Comparison result ```json { "digests": { "a": "59319772d1bcf2e0dd4b8a296f2d9682", "b": "2e7bc302a54ecec62d8155e19fbf2748" }, "arrays": { "a-only": [], "b-only": [], "a-and-b": ["lengths", "names", "sequences", "names_lengths"] }, "elements": { "total": {"a": 3, "b": 3}, "a-and-b": {"lengths": 3, "names": 3, "sequences": 3, "names_lengths": 3}, "a-and-b-same-order": { "lengths": false, "names": false, "sequences": false, "names_lengths": true } } } ``` --- ## Seqcol API demonstration <a href="https://seqcolapi.databio.org/">https://seqcolapi.databio.org/</a> --- ## API endpoints - `GET /service-info` - `GET /collection/:digest` - `GET /comparison/:digest1/:digest2` - `POST /comparison/:digest1` --- ## Conclusions - Refget provides universal IDs for individual sequences - Sequence collections extends this to reference genomes - Using a deterministic algorithm, you can find the identifier - A lookup service can retrieve the original sequence - A comparison function allows fine-grained compatibility tests - Please follow along: https://github.com/ga4gh/seqcol-spec --- <style> #acknowledgements { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="acknowledgements" data-background="/images/presentations/bg.svg.png"> # Thank You <br clear="all"/> <span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> · <span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> · <span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span> <div class="bullet" style="background-color:rgb(45,45,45,.65); border-radius: 25px; opacity:0.9"> <img src="/images/external/uva_dgs_logo.svg" height="65"> <img src="/images/logo/logo_databio_long.svg" height="45"> </div> </section>