MCP for BEDbase: Connecting AI to Genomic Data

# MCP for BEDbase: Connecting AI to Genomic Data

Nathan Sheffield

<span style="font-size:0.6em"><a href="http://www.databio.org/slides">www.databio.org/slides</a></span>

</section>
---

### Outline

<div class="previewblock" style="width:50%">Model Context Protocol</div>
<div class="previewblock" style="width:50%">BEDbase MCP</div>

<div class="questionblock" style="background:#222; color:#eee; font-size: 0.6em; margin-top: 35px">&#9665; Questions &#9655;</div>
---

## What is MCP?

**Model Context Protocol** - an open standard for connecting LLMs to external data

- Often described as "**USB-C for AI**"
- Standardizes how AI assistants access tools, data, and services
- Started as internal Anthropic project (July 2024)
- Open-sourced November 25, 2024
- OpenAI announced full MCP support March 2025
- December 2025: donated to Linux Foundation's Agentic AI Foundation
- https://modelcontextprotocol.io/specification/2025-06-18

Notes:
- The USB-C analogy: one standard connector instead of proprietary cables

---

## The Problem MCP Solves

### Without MCP

- Every AI app needs custom integrations
- 5 apps × 10 tools = **50 implementations**
- Duplicated effort, inconsistent behavior

</div>

### With MCP

- Build **one server** per tool
- Works with Claude, ChatGPT, Cursor...
- 5 apps × 10 tools = **10 implementations**

</div>

Notes:
- The key insight: AI apps already have MCP clients built in
- You only build the server side
- Each server works with ANY MCP-compatible AI application
- Same principle that made USB successful - standardization reduces complexity

---

## How MCP Works

**You build the Server. The AI app provides the rest.**

- AI apps (Claude, ChatGPT) have MCP clients built in
- Your server exposes tools the AI can call
- The AI decides when to use your tools based on conversation

Notes:
- MCP terminology is confusing: "Host" = the AI application (not your server!)
- "Client" = protocol handler within the AI app
- "Server" = what you build to expose your data/tools
- The AI app handles orchestration - your server just responds to tool calls

---

## MCP Primitives

(Things that can be exposed by an MCP server)

| Primitive | Purpose | Analogy |
|-----------|---------|---------|
| **Tools** | Execute actions, perform operations | POST requests |
| **Resources** | Read-only data access | GET requests |
| **Prompts** | Reusable templates for workflows | Macros |

Notes:
- These are the SERVER primitives (what servers offer)
- There are also CLIENT primitives: Sampling (server-initiated LLM calls), Roots (filesystem boundaries), Elicitation (request user input)
- Tools are model-controlled: the LLM decides when to call them based on docstrings
- Resources are application-controlled: explicitly requested by the host

---

## Tools Implementation Example

```python
from fastmcp import FastMCP

mcp = FastMCP("BEDbase")

@mcp.tool
def bed_search_text(query: str, limit: int = 10):
    """
    Search BEDbase for genomic region files
    using natural language.
    """
    results = bbagent.bed.hybrid_search(query)
    return results
```

The LLM reads the docstring to understand when and how to use each tool.

Notes:
- FastMCP is the leading Python framework for MCP servers
- The docstring is critical - it's how the AI learns what the tool does
- Type hints become the tool's parameter schema
- Keep tools focused: one purpose per tool

---

## MCP Transport Methods

### STDIO

- Client launches server as **subprocess**
- Communicates via stdin/stdout
- Best for **local** tools
- Desktop apps, CLI tools

</div>

### Streamable HTTP

- Server runs **independently**
- Multiple clients connect via HTTP
- Best for **remote** services
- Web APIs, microservices

</div>

Notes:
- STDIO: Local file access, desktop integration, CLI tools
- Streamable HTTP: Hosted services, multi-user, existing REST APIs
- "Streamable" refers to optional Server-Sent Events (SSE) support
- How it works: client sends POST requests; server can respond with immediate JSON OR open an SSE stream for long-running operations
- Why "streamable"? Some tool calls take time (search, analysis) - SSE lets server send progress updates
- MCP originally used pure SSE but deprecated it (March 2025) - security issue: one-time auth left connection open indefinitely
- Streamable HTTP allows per-request authentication while still supporting streaming when needed

---

## Keep Context Clean with Progressive Disclosure

A key design principle for MCP servers:

**Don't dump everything at once**. Keeps context clean and focused.

1. Search returns **brief summaries** (id, name, type) 
2. User (or AI) can follow-up on a specific result 
3. Get **detailed metadata** on demand 
4. Drill into **statistics** or **download** when needed

Notes:
- LLM context windows are limited and expensive
- Returning full metadata for 100 results wastes tokens
- Let the user drive the depth of exploration
- Design tools to return just enough to enable follow-up questions

---

## Progressive Disclosure in Practice

---

## MCP Tools vs API Endpoints

Design for **conversation**, not programmatic consumption.

**Keep as API endpoints** (not MCP):
- Paginated lists meant for iteration
- Small, granular tasks you'd call in a loop

**Expose as MCP tools**:
- Search that yields consolidated results
- Retrieval of specific knowledge
- Large or one-time tasks (trigger workflows, bulk updates)

If you'd call it in a for-loop, it's probably not an MCP tool.

Notes:
- The principle: MCP tools are for conversational use
- Programs iterate through pages; AI needs consolidated results
- Programs repeat small tasks; AI does one-shot larger operations
- Traditional APIs still have their place — MCP isn't a replacement

---

## MCP + Semantic Search

The real power: **MCP as an interface to RAG systems**

Traditional database query:
```sql
SELECT * FROM beds WHERE cell_type = 'K562' AND assay = 'ChIP-seq'
```

With semantic search MCP:
```
"enhancers active in K562 similar to CTCF binding"
```

MCP tools that leverage vector embeddings, pair the power of semantic search with AI.

Notes:
- RAG = Retrieval-Augmented Generation
- MCP servers can wrap vector databases (Qdrant, Pinecone, etc.)
- The AI handles intent → the MCP server handles retrieval
- This is where domain-specific embeddings shine (Region2Vec, BioEmbeddings)

---

## What is BEDbase?

The problem: researchers can't easily discover relevant BED files.

**BEDbase** aggregates and serves genomic region data:

- **130,000+ BED files** from GEO, ENCODE, and other sources
- **16,000+ BEDsets** (curated collections)
- **122 genomes** supported
- **Dual search**: metadata queries OR semantic similarity

Notes:
- BED files define genomic regions (ChIP-seq peaks, ATAC-seq, enhancers, etc.)
- Currently scattered across repositories with inconsistent metadata
- BEDbase standardizes and indexes them for discovery
- Region2Vec: like Word2Vec but for genomic intervals

---

## BEDbase + MCP

An MCP server makes BEDbase conversationally accessible.

Notes:
- The MCP server wraps the existing bedhost REST API
- Streamable HTTP transport - deployed alongside the API
- Primary interface: semantic text search via Text2BedNN
- Progressive disclosure: search → metadata → statistics → download

---

## BEDbase MCP Tools

| Tool | Purpose |
|------|---------|
| `bed_search_text` | Natural language search (primary) |
| `bed_get_metadata` | Detailed file metadata |
| `bed_get_statistics` | Quantitative stats (regions, GC content) |
| `bed_get_download_url` | Direct file access |
| `bed_find_similar` | Find related files by genomic content |

Notes:
- Search is the entry point - matches conversational interaction
- Metadata/stats enable drill-down without returning everything upfront
- find_similar uses Region2Vec embeddings for content-based similarity

---

## Text2BedNN: Semantic Genomics Search

"Enhancers in neural progenitors" → relevant BED files ranked by semantic similarity.

Notes:
- Text2BedNN maps text embeddings to BED file embeddings
- Trained on BED file metadata and genomic content
- Enables queries that would be impossible with keyword search
- Pre-trained models available on HuggingFace

---

## Client Configuration

```json
{
  "mcpServers": {
    "bedbase": {
      "url": "https://api.bedbase.org/mcp"
    }
  }
}
```

One line to connect any MCP-compatible AI to 130k genomic datasets.

---

## Example: Exploratory Research

> "Find ChIP-seq peaks for transcription factors in neural progenitor cells with high enrichment in promoter regions"

AI assistant:
1. Searches BEDbase via `bed_search_text`
2. Shows top matches with key metadata
3. User asks about a specific result
4. Drills down with `bed_get_statistics`
5. Offers download link

---

## Example: Pipeline Construction

> "I need 10 high-quality ATAC-seq datasets from
hematopoietic cells for training a peak caller.
Give me the download URLs."

AI assistant:
1. Searches with cell type and assay filters
2. Checks statistics (region counts, quality metrics)
3. Filters to top 10 by quality
4. Returns download URLs for pipeline input
5. Could generate a manifest file or Snakemake config

Notes:
- MCP enables programmatic dataset assembly
- The AI can apply quality filters based on statistics
- Output can feed directly into bioinformatics workflows
- Future: MCP tools that generate workflow configs

---

## The Role of Schemas

- **Different schemas** — data is siloed even with MCP access
- **Common schemas** — tools can exchange data seamlessly

MCP solves the *connection* problem; schemas solve the *interoperability* problem.

Notes:
- MCP increases access to data, which makes schema alignment more valuable
- JSON Schema is the de-facto standard for MCP tool definitions
- Shared schemas enable cross-tool workflows
- Example: FAIRtracks schema for genomic track files

---

## Summary

- **MCP** standardizes AI ↔ external data connections
- **Progressive disclosure** keeps context clean and focused
- **MCP + Semantic search** = powerful RAG interfaces
- **Schemas** are complementary: MCP solves *connection*, schemas solve *interoperability*
- **BEDbase MCP** opens 130k genomic datasets to AI assistants

---

# Thank You

<span class="small bullet"><img src="/images/external/github_bug_black.svg" height="20" class="bullet"><a href="http://github.com/nsheff">nsheff</a></span> &middot;
<span class="small bullet"><img src="/images/icons/web.svg" height="25" class="bullet"><a href="http://databio.org">databio.org</a></span> &middot;
<span class="small bullet"><img src="/images/icons/letter.svg" height="25" class="bullet"><a href="mailto:nsheffield@virginia.edu">nsheffield@virginia.edu</a></span>

</section>