A toy of a metadata-first platform for preclinical lab data: a DuckDB-backed catalog, lineage tracking, MCP-server, and AI-output provenance. Pipelines are bronze → silver → gold and emit OpenLineage events. A Streamlit UI has three tabs: Architecture, Context Layer, and Chat (with click-to-trace lineage).
Note on data: everything in here is synthetic or public-fixture data. There are no real proprietary assets, patient records, or company-internal compounds — the
CMP-NNNIDs are placeholders, and the one real fixture (a SoftMax Pro export) is from Benchling's open-sourceallotropytest suite.
The point of the prototype is to show that the catalog is load-bearing: classification, lineage, and the tool surface belong in the foundation, not in a later phase. Sensitivity gating happens server-side in the catalog, before any data reaches the agent's context. Its got a bunch of limitations that obviously prevent it from being a real product, just wanted to define concepts in a simple toy.
![]() |
![]() |
| Context Layer — what the agent reads before it reads any data | Chat — every answer carries provenance |
Get the whole thing running locally — venv, deps, data, catalog, UI — in about a minute:
uv venv && source .venv/bin/activate
uv pip install --override constraints.txt -r requirements.txt
python -m source.pipelines.ingest_bronze
python -m source.pipelines.bronze_to_silver
python -m source.pipelines.silver_to_gold
streamlit run source/ui/app.pyThat's it — open the URL Streamlit prints, poke around the Architecture, Context Layer, and Chat tabs, and try a question like "Is CMP-004 active in TOPFlash and clean in NHP tox?". Without an ANTHROPIC_API_KEY set, the chat falls back to a canned mock that exercises the same tool path. The rest of this README explains what's going on under the hood.
Two open-standard preclinical datasets, each landed through three layers:
| Dataset | Layers | Open standard | Classification |
|---|---|---|---|
| TOPFlash Wnt/β-catenin reporter assay | bronze + silver + gold | Allotrope ASM (allotropy.parse) |
internal |
| NHP 28-day repeat-dose toxicology study | bronze + silver + gold | CDISC SEND | restricted, GLP |
The bronze TOPFlash data is a real SoftMax Pro fixture from Benchling's allotropy test data. The NHP SEND data and the TOPFlash plate map are synthetic (scripts/generate_synthetic_bronze.py). One investigational compound (CMP-004) appears in both datasets, so the agent has a reason to query both and synthesize an activity-vs-safety answer that no single source could produce alone.
Uses uv and a local .venv.
uv venv
source .venv/bin/activate
uv pip install --override constraints.txt -r requirements.txtThe bronze data is committed. Pipelines populate the catalog and silver/gold parquets:
# (only first time, or to regenerate synthetic bronze)
python scripts/generate_synthetic_bronze.py
# Land the 2 datasets in the catalog
python -m source.pipelines.ingest_bronze
# Parse silver: allotropy for TOPFlash, SEND-domain reader for NHP tox
python -m source.pipelines.bronze_to_silver
# Aggregate gold: per-compound % inhibition (TOPFlash) + per-cohort summary (NHP)
python -m source.pipelines.silver_to_goldAfter this, source/catalog/catalog.duckdb holds 6 dataset entries (2 datasets × 3 layers) and several hundred lineage edges. You can poke it directly:
duckdb source/catalog/catalog.duckdb \
"SELECT dataset_id, layer, classification FROM datasets ORDER BY layer, dataset_id;"The agent uses claude-sonnet-4-6. Set ANTHROPIC_API_KEY for real LLM responses; otherwise it falls back to a canned mock that exercises the same tool path.
# Streamlit demo (Architecture / Context Layer / Chat)
streamlit run source/ui/app.py
# Or invoke the agent directly
python -c "
from source.agent.client import CatalogAgent
a = CatalogAgent(caller_role='researcher_internal')
print(a.chat('What is the most potent compound against Wnt/β-catenin signaling?'))
"# stdio transport (default; for connecting an MCP client)
python -m source.mcp_server.server
# Streamable HTTP transport
python -m source.mcp_server.server --httpThe agent is constructed with a caller_role (researcher_internal or researcher_external). The MCP tools consult the catalog's classification field server-side and filter results before they reach the model. An external role asking about the restricted NHP dataset gets error: access_denied from the tool — the model never sees the data and has no way to reason its way around the gate.
source/
catalog/ # Pydantic schema + DuckDB catalog with upsert/list/lineage helpers
pipelines/ # ingest_bronze, bronze_to_silver, silver_to_gold (emit OpenLineage events)
mcp_server/ # tools (sensitivity-gated) + FastMCP server wrapper
agent/ # Anthropic-SDK agent client; system prompt; mock fallback
observability/ # OpenLineage event emission + structured logging
ui/ # Streamlit app: Architecture / Context Layer / Chat
data/{bronze,silver,gold}/ # actual data files
scripts/
generate_synthetic_bronze.py # one-shot synthetic data generator
test_mcp_stdio.py # smoke test for the MCP stdio transport
tests/
test_smoke.py # end-to-end pipeline + agent
test_mcp_agent.py # MCP-mediated tool-call paths
Built by Nicholas Justice. Questions or thoughts → open an issue or ping me on GitHub.

