<style> #title { height: 100% !important; display: flex !important; flex-direction: column !important; justify-content: center !important; } </style> <section id="title" data-background="/images/presentations/bg.svg.png" data-transition-speed="slow"> # GKS Schemas and the proposed GA4GH Schema Registry Nathan Sheffield <div class="bullet"> <img src="/images/external/uva_dgs_logo.svg" height="85"> <img src="/images/logo/logo_databio_long.svg" height="65"> </div> <span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span> </section> --- ## GKS: Genomic Knowledge Standards - One composite YAML source per product with custom extensions (inheritance, maturity levels) - `make` compiles to individual JSON Schema files (Draft 2020-12) - Published via GitHub releases with permanent w3id.org URLs --- ## GKS: Build system <pre class="mermaid"> graph LR A[vrs-source.yaml] --> M{{make}} --> B[JSON Schema files] B --> C[Allele.json] B --> D[SequenceLocation.json] B --> E[25+ classes...] C --> F[w3id.org URLs] </pre> <div style="display: flex; justify-content: space-between;"> <div style="width: 38%;"> ``` ga4gh/vrs/ ├── schema/ │ ├── vrs/ │ │ ├── vrs-source.yaml │ │ ├── Makefile │ │ └── json/ │ │ ├── Allele │ │ ├── SequenceLocation │ │ ├── CopyNumberCount │ │ └── 25 more... │ └── gks-core/ ├── examples/ ├── validation/ └── docs/ ``` </div> <div style="width: 58%;"> ```yaml # schema/vrs/vrs-source.yaml $schema: ".../draft/2020-12/schema" $id: ".../ga4gh/schema/vrs/2.0.1/..." title: "GA4GH-VRS-Definitions" imports: gks-core: ../gks-core/...-source.yaml $defs: Ga4ghIdentifiableObject: maturity: trial use inherits: gks-core:Entity Variation: inherits: Ga4ghIdentifiableObject Allele: inherits: Variation ... ``` </div> </div> --- ## Produced output: json/Allele https://w3id.org/ga4gh/schema/vrs/2.0.1/json/Allele ```json { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://w3id.org/ga4gh/schema/vrs/2.0.1/json/Allele", "title": "Allele", "type": "object", "maturity": "trial use", "description": "The state of a molecule at a Location.", "properties": { "type": { "const": "Allele" }, "location": { "oneOf": [ { "$ref": ".../vrs/2.0.1/json/SequenceLocation" }, { "$ref": ".../gks-core/1.1.0/json/iriReference" } ] }, "state": { "oneOf": [ { "$ref": ".../vrs/2.0.1/json/LengthExpression" }, { "$ref": ".../vrs/2.0.1/json/LiteralSequenceExpression" }, { "$ref": ".../vrs/2.0.1/json/ReferenceLengthExpression" } ] }, ... }, "required": ["location", "state", "type"] } ``` --- ## The proposed GA4GH Schema Registry A **read-only REST API specification** -- a thin interoperability layer <pre class="mermaid"> graph LR subgraph Servers A1[EBI Registry] A2[GA4GH Registry] A3[Lab Registry on GitHub Pages] end A1 & A2 & A3 --> B[Schema Registry API spec] --> C1 & C2 & C3 subgraph Clients C1[Web UI] C2[CLI tool] C3[SDK] end style B stroke:#f90,stroke-width:3px </pre> ``` GET /namespaces # List organizations GET /schemas/{namespace} # List schemas GET /schemas/{ns}/{schema}/versions # List versions GET /schemas/{ns}/{schema}/versions/{ver} # Get JSON Schema ``` **Not specified:** backend storage, build process, frontend UI, or a GA4GH-condoned registry instance -- only the API contract between them --- ## [ga4gh/schema-registry#7](https://github.com/ga4gh/schema-registry/issues/7) <img src="/slides/gks-schema-registry/issue-7.png" width="700"> --- ## **Benefits of a registry** <pre class="mermaid"> graph TD A["<b>Interoperability of schema access</b><br/>Shared API — any client talks to any server"] A --> B["<b>Discovery</b><br/>List, search, and filter schemas across organizations"] B --> C["<b>Reuse</b><br/>Adopt existing schemas instead of reinventing"] C --> D["<b>Interoperability of analyses</b><br/>Compatible data formats, shared tooling"] style A fill:#2a4a2a,stroke:#4a8a4a style D fill:#2a4a2a,stroke:#4a8a4a </pre> --- ## Possible benefits to GKS schemas Current GKS distribution works for **static access** via w3id.org But how do you... - Programmatically list available schemas - Search or filter by maturity level - Discover available versions without navigating GitHub - Find schemas across organizations - Link GKS schemas to non-GKS schemas The registry could fill the **discoverability gap** The registry could encourage **interoperability** through a shared access mechanism ...but there are some tensions <!-- .element: class="fragment" --> --- ## Tensions 1. **Existing distribution system** -- GKS already distributes schemas via w3id.org permanent URLs; the registry specifies different URLs 2. **Layer mismatch** -- GKS has 3 levels (org / product / class), but the registry spec has 2 (namespace / schema) --- ## Tension 1: Existing build and distribution system GKS schemas already have a distribution pipeline: - Source YAML → `make` → individual JSON Schema files - Published via GitHub releases - Permanent URLs via **w3id.org** (e.g. `https://w3id.org/ga4gh/schema/vrs/2.0.1/json/Allele`) The schema registry specifies **different URLs**: - `GET /schemas/ga4gh/vrs/versions/2.0.1` How do these coexist? Which is canonical? Does the registry replace w3id.org or layer on top? How does the build system feed into a schem registry? --- ## Tension 1: Possible solutions 1. **Overlay** -- Nothing changes; registry reads from existing GitHub releases as another access layer 2. **Redirect** -- w3id.org redirects point to a schema registry server instead of raw GitHub; authoring/build system unchanged 3. **Registry as primary** -- GA4GH runs a schema registry; GKS build system pushes final schemas into it on release --- ## Tension 2: Layer mismatch The registry spec has **2 levels**: namespace / schema_name But GKS has **3 levels**: organization / product / class <pre class="mermaid"> graph TD A[ga4gh] --> B[vrs] A --> C[gks-core] A --> D[cat-vrs] A --> E[va-spec] B --> F[Allele] B --> G[SequenceLocation] B --> H[CopyNumberCount] </pre> --- ## [ga4gh/schema-registry#6](https://github.com/ga4gh/schema-registry/issues/6) <img src="/slides/gks-schema-registry/issue-6.png" width="700"> --- ## Tension 2: The mapping options GKS authors **one composite YAML** per product but distributes **25+ individual JSON files** -- register per-class or per-product? | Approach | Namespace | Schema name | Problem | |----------|-----------|-------------|---------| | One NS per product | `vrs` | `Allele` | Loses org grouping | | Flat namespace | `ga4gh` | `Allele`, `Entity`... | Loses product grouping; names collide | | Convention NS prefix | `ga4gh-vrs` | `Allele` | Fragile, not machine-readable | | Convention schema prefix | `ga4gh` | `vrs-Allele` | Fragile, not machine-readable | | Bundled | `ga4gh` | `vrs` | Loses class-level access | --- ## Tension 2: Possible solutions 1. **Add a SchemaGroup concept** to the registry spec (inspired by Apicurio) - namespace = `ga4gh`, group = `vrs`, schema = `Allele` 2. **Bundled schemas + component endpoint** (lighter weight) - Register each product as one bundled schema - Add `/components/{name}` sub-endpoint for class-level access 3. Just broadcast the schema as **bundled with `$defs`** - Serve a compound JSON Schema document per product - All classes inside `$defs`, supporting both product-level and class-level views --- ## Proposal: Schema Registry Site A **static site implementation** of the registry spec <pre class="mermaid"> graph LR A[GitHub repos] -->|Python build| B[Static JSON API] B --> C[React frontend] C --> D[GitHub Pages] </pre> - Pulls schemas from GA4GH repos at build time - Generates static JSON files matching the registry API - React UI for browsing namespaces, schemas, versions, components Benefits: - discoverability? - interoperability? https://nsheff.github.io/schema-registry-site/ --- ## Discussion 1. Point 1: Is the distribution/build-system workflow acceptable? 2. Point 2: How should GKS's 3-level hierarchy map into the registry? SchemaGroup, Components, distributing bundled schemas, or something else? 3. What other "tension points" have I missed? 4. Is there value here? If so, is there a path to adoption? What changes to GKS distrubtion system or changes to the schema registry specification would enable this?