GitHub - HackWidMaddy/Markov-Chain: Character-level Markov Chain (k-order) for Next-Character Prediction

Character-level Markov Chain (k-order) for Next-Character Prediction

This project trains a character-level Markov model to predict the next character given the last k characters (order k). It supports classic add-α (Laplace) smoothing and Witten–Bell interpolated backoff for better accuracy on unseen contexts.

How it works (short)

Training scans the text and counts transitions from each context (last k chars) to the next char.
Smoothing avoids zero-probabilities. Add-α adds a small mass to every symbol; Witten–Bell backs off to lower orders based on how many unique next symbols were seen.
Prediction uses either the precomputed k-gram probabilities (add-α) or Witten–Bell interpolation across orders 1..k.
Generation samples or picks the argmax given a seed string.

Quickstart (Windows PowerShell)

Prepare a plain text file, e.g., markov_char\data.txt.
Train and evaluate (order 2, add-α smoothing):

python -m markov_char.cli train --input markov_char\data.txt --order-k 2 --add-alpha 0.5 --lowercase --model-out model.json

Train with higher accuracy using Witten–Bell interpolation (recommended when you have enough data):

python -m markov_char.cli train --input markov_char\data.txt --order-k 3 --witten-bell --lowercase --model-out model.json

Evaluate a saved model:

python -m markov_char.cli eval --model model.json --input markov_char\data.txt

Generate text:

# Deterministic (argmax)
python -m markov_char.cli generate --model model.json --seed-text "interconnec" --length 1 --temperature 0

# Stochastic with fixed RNG seed (reproducible)
python -m markov_char.cli generate --model model.json --seed-text "interconnec" --length 20 --temperature 0.9 --seed-rng 123

Inspect next-char probabilities for a context:

python -m markov_char.cli probs --model model.json --context "interconnec" --topk 10

Benchmark against brute-force baseline (accuracy, perplexity, search-space reduction):

python -m markov_char.cli benchmark --model model.json --input markov_char\data.txt

Commands and flags

train:
- --input: path to text file
- --order-k: Markov order (memory length). Larger k needs more data.
- --add-alpha: add-α smoothing (0 disables). Used when NOT using Witten–Bell.
- --witten-bell: enable Witten–Bell interpolated backoff across orders 1..k
- --lowercase: lowercase normalization at training time
- --strip-whitespace, --remove-punctuation, --allowed-extra: control preprocessing
- --train-frac, --val-frac, --seed: dataset split and reproducibility
- --model-out: save trained model (JSON)
eval:
- --model: path to model JSON
- --input: path to text (same domain/preprocessing as training)
- --train-frac, --val-frac, --seed: to derive test split from the file
generate:
- --model: path to model JSON
- --seed-text: initial text; only last k chars are used as context
- --length: number of characters to generate
- --temperature: 0 = argmax (deterministic), >0 applies stochastic sampling
- --seed-rng: random seed for reproducible sampling
probs:
- --model: path to model JSON
- --context: show next-char distribution for this history
- --topk: number of top characters to print
benchmark:
- --model: path to model JSON
- --input: path to text (same domain/preprocessing as training)
- --train-frac, --val-frac, --seed: to derive test split from the file
- --n-tokens: limit number of test tokens (optional)
- --seed-bench: random seed for brute-force baseline (default: 42)

Smoothing methods

Add-α (Laplace): adds α count to each symbol before normalizing. Good baseline; too large α can over-flatten probabilities.
Witten–Bell backoff (recommended): interpolates higher-order estimates with lower orders according to the ratio of total events (N) vs unique continuations (T) in the context:
- λ = N / (N + T)
- P_k = λ · MLE_k + (1 − λ) · P_{k−1}
- Falls back to uniform at order 0. Helps greatly with unseen/rare contexts.

Troubleshooting

"ModuleNotFoundError: No module named 'markov_char'": run commands from the project root (e.g., D:\CNS ISE), not inside the markov_char folder. Or set PYTHONPATH to the parent directory.
Predictions look random for short contexts: ensure your prompt matches training normalization (e.g., use lowercase if you trained with --lowercase). With temperature 0, you’ll see the deterministic argmax.
Very flat probabilities: increase data size, reduce add-alpha, or enable --witten-bell. Consider increasing --order-k if you have enough data.

Maximizing accuracy

Data size and domain match matter most. Use more text similar to your prompts.
Choose order k appropriate to data: k=2–3 for small datasets; k=3–5 for larger.
Prefer --witten-bell for better generalization to unseen contexts.
Normalize consistently: if you train with --lowercase, prompt in lowercase.
Evaluate via perplexity/accuracy on a held-out test split and pick k that minimizes perplexity.
For hard accuracy comparisons, use --temperature 0 when generating.

Benchmark output

The benchmark command compares your trained model against a brute-force (uniform random) baseline and reports:

Accuracy: percentage of correct next-character predictions
Cross-entropy: average bits per prediction (lower is better)
Perplexity: effective uncertainty in next-char distribution (lower is better)
Search space reduction: how much the model narrows the effective alphabet size
Speed: prediction rate in tokens/second
Improvement: percentage gains over brute-force

Example output shows the Markov model typically achieves 20–50% accuracy vs ~3% for uniform, with 90%+ search-space reduction.

Requirements

Python 3.9+
Optional: matplotlib (future plotting extension)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
markov_char		markov_char
README.md		README.md
find_passwords.py		find_passwords.py
model.json		model.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Character-level Markov Chain (k-order) for Next-Character Prediction

How it works (short)

Quickstart (Windows PowerShell)

Commands and flags

Smoothing methods

Troubleshooting

Maximizing accuracy

Benchmark output

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Character-level Markov Chain (k-order) for Next-Character Prediction

How it works (short)

Quickstart (Windows PowerShell)

Commands and flags

Smoothing methods

Troubleshooting

Maximizing accuracy

Benchmark output

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages