Metadata Catalog Prototype Toy

A toy of a metadata-first platform for preclinical lab data: a DuckDB-backed catalog, lineage tracking, MCP-server, and AI-output provenance. Pipelines are bronze → silver → gold and emit OpenLineage events. A Streamlit UI has three tabs: Architecture, Context Layer, and Chat (with click-to-trace lineage).

Note on data: everything in here is synthetic or public-fixture data. There are no real proprietary assets, patient records, or company-internal compounds — the CMP-NNN IDs are placeholders, and the one real fixture (a SoftMax Pro export) is from Benchling's open-source allotropy test suite.

The point of the prototype is to show that the catalog is load-bearing: classification, lineage, and the tool surface belong in the foundation, not in a later phase. Sensitivity gating happens server-side in the catalog, before any data reaches the agent's context. Its got a bunch of limitations that obviously prevent it from being a real product, just wanted to define concepts in a simple toy.



Context Layer — what the agent reads before it reads any data	Chat — every answer carries provenance

Quick start

Get the whole thing running locally — venv, deps, data, catalog, UI — in about a minute:

uv venv && source .venv/bin/activate
uv pip install --override constraints.txt -r requirements.txt

python -m source.pipelines.ingest_bronze
python -m source.pipelines.bronze_to_silver
python -m source.pipelines.silver_to_gold

streamlit run source/ui/app.py

That's it — open the URL Streamlit prints, poke around the Architecture, Context Layer, and Chat tabs, and try a question like "Is CMP-004 active in TOPFlash and clean in NHP tox?". Without an ANTHROPIC_API_KEY set, the chat falls back to a canned mock that exercises the same tool path. The rest of this README explains what's going on under the hood.

What's in it

Two open-standard preclinical datasets, each landed through three layers:

Dataset	Layers	Open standard	Classification
TOPFlash Wnt/β-catenin reporter assay	bronze + silver + gold	Allotrope ASM (`allotropy.parse`)	`internal`
NHP 28-day repeat-dose toxicology study	bronze + silver + gold	CDISC SEND	`restricted`, GLP

The bronze TOPFlash data is a real SoftMax Pro fixture from Benchling's allotropy test data. The NHP SEND data and the TOPFlash plate map are synthetic (scripts/generate_synthetic_bronze.py). One investigational compound (CMP-004) appears in both datasets, so the agent has a reason to query both and synthesize an activity-vs-safety answer that no single source could produce alone.

Setup

Uses uv and a local .venv.

uv venv
source .venv/bin/activate
uv pip install --override constraints.txt -r requirements.txt

Materialize the data and the catalog

The bronze data is committed. Pipelines populate the catalog and silver/gold parquets:

# (only first time, or to regenerate synthetic bronze)
python scripts/generate_synthetic_bronze.py

# Land the 2 datasets in the catalog
python -m source.pipelines.ingest_bronze

# Parse silver: allotropy for TOPFlash, SEND-domain reader for NHP tox
python -m source.pipelines.bronze_to_silver

# Aggregate gold: per-compound % inhibition (TOPFlash) + per-cohort summary (NHP)
python -m source.pipelines.silver_to_gold

After this, source/catalog/catalog.duckdb holds 6 dataset entries (2 datasets × 3 layers) and several hundred lineage edges. You can poke it directly:

duckdb source/catalog/catalog.duckdb \
  "SELECT dataset_id, layer, classification FROM datasets ORDER BY layer, dataset_id;"

Run the agent and UI

The agent uses claude-sonnet-4-6. Set ANTHROPIC_API_KEY for real LLM responses; otherwise it falls back to a canned mock that exercises the same tool path.

# Streamlit demo (Architecture / Context Layer / Chat)
streamlit run source/ui/app.py

# Or invoke the agent directly
python -c "
from source.agent.client import CatalogAgent
a = CatalogAgent(caller_role='researcher_internal')
print(a.chat('What is the most potent compound against Wnt/β-catenin signaling?'))
"

Run the MCP server standalone

# stdio transport (default; for connecting an MCP client)
python -m source.mcp_server.server

# Streamable HTTP transport
python -m source.mcp_server.server --http

Roles and sensitivity gating

The agent is constructed with a caller_role (researcher_internal or researcher_external). The MCP tools consult the catalog's classification field server-side and filter results before they reach the model. An external role asking about the restricted NHP dataset gets error: access_denied from the tool — the model never sees the data and has no way to reason its way around the gate.

Layout

source/
  catalog/          # Pydantic schema + DuckDB catalog with upsert/list/lineage helpers
  pipelines/        # ingest_bronze, bronze_to_silver, silver_to_gold (emit OpenLineage events)
  mcp_server/       # tools (sensitivity-gated) + FastMCP server wrapper
  agent/            # Anthropic-SDK agent client; system prompt; mock fallback
  observability/    # OpenLineage event emission + structured logging
  ui/               # Streamlit app: Architecture / Context Layer / Chat
  data/{bronze,silver,gold}/  # actual data files
scripts/
  generate_synthetic_bronze.py  # one-shot synthetic data generator
  test_mcp_stdio.py             # smoke test for the MCP stdio transport
tests/
  test_smoke.py     # end-to-end pipeline + agent
  test_mcp_agent.py # MCP-mediated tool-call paths

Built by Nicholas Justice. Questions or thoughts → open an issue or ping me on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
.streamlit		.streamlit
docs/screenshots		docs/screenshots
scripts		scripts
source		source
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
constraints.txt		constraints.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metadata Catalog Prototype Toy

Quick start

What's in it

Setup

Materialize the data and the catalog

Run the agent and UI

Run the MCP server standalone

Roles and sensitivity gating

Layout

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Metadata Catalog Prototype Toy

Quick start

What's in it

Setup

Materialize the data and the catalog

Run the agent and UI

Run the MCP server standalone

Roles and sensitivity gating

Layout

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages