datagen is the repository's offline synthetic dataset bootstrapper for the paper-aligned BitNet b1.58 runtime. The merged implementation combines deterministic variation patterns with the repository's prompt-template system so each generated JSONL record carries both a reusable DataGen prompt and structured training metadata. Seeds remain optional at runtime, but when supplied they are used for grounding, prompt rendering, and output attribution.
dotnet run --project src/BitNetSharp.App/BitNetSharp.App.csproj -- datagen \
--domain "medical-diagnosis" \
--count 50000 \
--seeds examples/seed-examples.json \
--output data/synthetic-medical.jsonl \
--constraint "Use American English" \
--lora medical-lora.bin--domain=...required target domain--count=...number of accepted examples to write--output=...output JSONL file--seeds=...optional JSON array of seed examples using eitherinstruction,prompt, orinputplusresponse,output, oranswer--task-type=...optional task family label such asinstruction-response,qa, orclassification--constraint=...repeatable natural-language constraints--constraints=a,b,ccomma-separated constraint shorthand--output-schema=...optional schema description injected into the prompt template--template=...optional absolute or relative path to a JSON prompt template--lora=...optional LoRA artifact path recorded in output metadata--candidate-count=...number of self-consistency passes per example, default3--min-quality=...acceptance threshold between0.0and1.0, default0.45--max-tokens=...optional model output limit
Seed files are standard JSON arrays. Each object must include one instruction-like field and one response-like field. DataGen accepts the following aliases:
- instruction:
instruction,prompt, orinput - response:
response,output, oranswer
Example:
[
{
"prompt": "Summarize the patient's main complaint and likely differential diagnosis.",
"response": "Restate the complaint, list the most likely causes, and flag any immediate safety concerns."
},
{
"instruction": "Explain what evidence should be gathered before choosing a treatment plan.",
"answer": "Collect history, exam findings, recent medications, and any contraindications before recommending next steps."
}
]When --seeds is omitted, the command synthesizes a neutral seed from the requested domain and task type so small bootstrap runs still work.
Each JSONL line includes the instruction-response pair plus generation metadata:
{
"instruction": "Create a medical-diagnosis task that starts from this seed: Summarize the patient's main complaint and likely differential diagnosis. [sample 1]",
"response": "Use the seed response as the baseline: Restate the complaint, list the most likely causes, and flag any immediate safety concerns. Then adapt it for medical-diagnosis work with extra attention to complaint, diagnosis, safety.",
"prompt": "You are DataGen, a domain-agnostic synthetic data generator...",
"domain": "medical-diagnosis",
"taskType": "instruction-response",
"qualityScore": 0.8167,
"generationTimestamp": "2026-03-20T00:00:00+00:00",
"groundingContext": [
"Summarize the patient's main complaint and likely differential diagnosis."
],
"lora": "/absolute/path/to/medical-lora.bin",
"seedInstruction": "Summarize the patient's main complaint and likely differential diagnosis.",
"seedResponse": "Restate the complaint, list the most likely causes, and flag any immediate safety concerns.",
"variation": "pattern-1",
"generatorModel": "bitnet-b1.58-sharp",
"tags": [
"synthetic",
"offline",
"pattern-1",
"medical",
"diagnosis"
]
}The trailing [sample N] suffix in instruction is part of the actual generated output. It gives each accepted example a stable batch-local ordinal so preview runs and large exports can be inspected without losing their original generation order.
The repository ships with a default JSON template at /templates/datagen/default.json. Templates expose the placeholders {domain}, {task_type}, {seed_examples}, {constraints}, {output_schema}, {count}, {sample_number}, {variation}, {seed_instruction}, and {seed_response}. The built-in variation patterns from the core generator are injected into that template so the two prompt systems stay merged rather than diverging, while the emitted JSON uses camelCase metadata fields such as taskType, qualityScore, generationTimestamp, and generatorModel.
The current implementation applies lightweight quality scoring to every accepted example:
- prompt/response schema validation
- self-consistency scoring across repeated BitNet cue generations
- lexical diversity scoring against previously accepted responses
Use a smaller preview run first, inspect the JSONL output, and then scale up counts once the prompt template and constraints match your target domain.
DataGen is intentionally local-first:
- generation runs entirely offline
- output stays in your working directory
- the same built-in BitNet model ID and optional LoRA path are recorded with every example for traceability