Skip to content

nbjustice/metadata-catalog-prototype

Repository files navigation

Metadata Catalog Prototype Toy

ci license: MIT python: 3.12

A toy of a metadata-first platform for preclinical lab data: a DuckDB-backed catalog, lineage tracking, MCP-server, and AI-output provenance. Pipelines are bronze → silver → gold and emit OpenLineage events. A Streamlit UI has three tabs: Architecture, Context Layer, and Chat (with click-to-trace lineage).

Note on data: everything in here is synthetic or public-fixture data. There are no real proprietary assets, patient records, or company-internal compounds — the CMP-NNN IDs are placeholders, and the one real fixture (a SoftMax Pro export) is from Benchling's open-source allotropy test suite.

The point of the prototype is to show that the catalog is load-bearing: classification, lineage, and the tool surface belong in the foundation, not in a later phase. Sensitivity gating happens server-side in the catalog, before any data reaches the agent's context. Its got a bunch of limitations that obviously prevent it from being a real product, just wanted to define concepts in a simple toy.

Context Layer tab Chat tab
Context Layer — what the agent reads before it reads any data Chat — every answer carries provenance

Quick start

Get the whole thing running locally — venv, deps, data, catalog, UI — in about a minute:

uv venv && source .venv/bin/activate
uv pip install --override constraints.txt -r requirements.txt

python -m source.pipelines.ingest_bronze
python -m source.pipelines.bronze_to_silver
python -m source.pipelines.silver_to_gold

streamlit run source/ui/app.py

That's it — open the URL Streamlit prints, poke around the Architecture, Context Layer, and Chat tabs, and try a question like "Is CMP-004 active in TOPFlash and clean in NHP tox?". Without an ANTHROPIC_API_KEY set, the chat falls back to a canned mock that exercises the same tool path. The rest of this README explains what's going on under the hood.

What's in it

Two open-standard preclinical datasets, each landed through three layers:

Dataset Layers Open standard Classification
TOPFlash Wnt/β-catenin reporter assay bronze + silver + gold Allotrope ASM (allotropy.parse) internal
NHP 28-day repeat-dose toxicology study bronze + silver + gold CDISC SEND restricted, GLP

The bronze TOPFlash data is a real SoftMax Pro fixture from Benchling's allotropy test data. The NHP SEND data and the TOPFlash plate map are synthetic (scripts/generate_synthetic_bronze.py). One investigational compound (CMP-004) appears in both datasets, so the agent has a reason to query both and synthesize an activity-vs-safety answer that no single source could produce alone.

Setup

Uses uv and a local .venv.

uv venv
source .venv/bin/activate
uv pip install --override constraints.txt -r requirements.txt

Materialize the data and the catalog

The bronze data is committed. Pipelines populate the catalog and silver/gold parquets:

# (only first time, or to regenerate synthetic bronze)
python scripts/generate_synthetic_bronze.py

# Land the 2 datasets in the catalog
python -m source.pipelines.ingest_bronze

# Parse silver: allotropy for TOPFlash, SEND-domain reader for NHP tox
python -m source.pipelines.bronze_to_silver

# Aggregate gold: per-compound % inhibition (TOPFlash) + per-cohort summary (NHP)
python -m source.pipelines.silver_to_gold

After this, source/catalog/catalog.duckdb holds 6 dataset entries (2 datasets × 3 layers) and several hundred lineage edges. You can poke it directly:

duckdb source/catalog/catalog.duckdb \
  "SELECT dataset_id, layer, classification FROM datasets ORDER BY layer, dataset_id;"

Run the agent and UI

The agent uses claude-sonnet-4-6. Set ANTHROPIC_API_KEY for real LLM responses; otherwise it falls back to a canned mock that exercises the same tool path.

# Streamlit demo (Architecture / Context Layer / Chat)
streamlit run source/ui/app.py

# Or invoke the agent directly
python -c "
from source.agent.client import CatalogAgent
a = CatalogAgent(caller_role='researcher_internal')
print(a.chat('What is the most potent compound against Wnt/β-catenin signaling?'))
"

Run the MCP server standalone

# stdio transport (default; for connecting an MCP client)
python -m source.mcp_server.server

# Streamable HTTP transport
python -m source.mcp_server.server --http

Roles and sensitivity gating

The agent is constructed with a caller_role (researcher_internal or researcher_external). The MCP tools consult the catalog's classification field server-side and filter results before they reach the model. An external role asking about the restricted NHP dataset gets error: access_denied from the tool — the model never sees the data and has no way to reason its way around the gate.

Layout

source/
  catalog/          # Pydantic schema + DuckDB catalog with upsert/list/lineage helpers
  pipelines/        # ingest_bronze, bronze_to_silver, silver_to_gold (emit OpenLineage events)
  mcp_server/       # tools (sensitivity-gated) + FastMCP server wrapper
  agent/            # Anthropic-SDK agent client; system prompt; mock fallback
  observability/    # OpenLineage event emission + structured logging
  ui/               # Streamlit app: Architecture / Context Layer / Chat
  data/{bronze,silver,gold}/  # actual data files
scripts/
  generate_synthetic_bronze.py  # one-shot synthetic data generator
  test_mcp_stdio.py             # smoke test for the MCP stdio transport
tests/
  test_smoke.py     # end-to-end pipeline + agent
  test_mcp_agent.py # MCP-mediated tool-call paths

Built by Nicholas Justice. Questions or thoughts → open an issue or ping me on GitHub.

About

Toy metadata-first catalog for preclinical biotech/pharma data — DuckDB catalog, MCP tool surface, Anthropic-SDK agent, Streamlit UI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages