Rules vs. LLM โ who extracts calendar events from messy emails better?
A Python evaluation pipeline that benchmarks a rule-based baseline against a local LLM
on structured extraction from raw email-style messages.
Getting Started ยท Results ยท Dataset ยท Methods
Assistant-style products need to pull actionable structure out of messy text โ calendar events, reminders, deadlines, action items. That gets hard fast when messages contain multiple dates, relative time expressions, cancellations, ambiguous phrasing, and inconsistent formatting.
This project builds a repeatable evaluation pipeline around that problem rather than just running a single model and eyeballing the output.
๐จ Raw Email โโโถ ๐ง Rule-Based + LLM Extractor โโโถ ๐ Evaluate & Score โโโถ ๐ Compare & Visualise
| Field | Description |
|---|---|
calendar_event_required |
Should an event be created? |
event_category |
Type of event |
event_date |
When it happens |
start_time / end_time |
Time window |
action_required |
Does the reader need to do something? |
action_type |
What kind of action |
action_deadline |
By when |
summary |
Short description |
Two extraction methods are benchmarked side-by-side against a 40-row labelled benchmark (20 synthetic + 20 Enron-derived).
| Metric | Rule-Based Baseline | Qwen 3 8B | Edge |
|---|---|---|---|
| Avg latency | 95 ms |
24,516 ms |
๐ฆ Baseline |
| Calendar event F1 | 0.917 |
0.902 |
๐ฆ Baseline |
| Action required F1 | 0.760 |
0.964 |
๐ง Qwen |
| Event category macro F1 | 0.776 |
0.733 |
๐ฆ Baseline |
| Action type macro F1 | 0.629 |
0.819 |
๐ง Qwen |
| Event date accuracy | 0.700 |
0.875 |
๐ง Qwen |
| Action deadline accuracy | 0.800 |
0.800 |
โฌ Tie |
Tip
Bottom line: Neither system wins outright. The strongest practical outcome is a hybrid approach โ let rules handle the easy stuff fast, route the harder cases to an LLM.
| Source | Rows | Examples |
|---|---|---|
| Synthetic | 20 | Trip reminders, parent meetings, club updates, payment deadlines, cancellations |
| Enron-derived | 20 | Real corporate email language, messier formatting, less predictable structure |
1,000 raw emails โ -32 dupes โ 778 clean โ 120 candidates โ 20 labelled
Raw Enron maildir is not committed (too large). The repo includes cleaned artefacts, labelled data, and all outputs.
Keyword matching, regex, date parsing, and priority rules. Fast, interpretable, easy to debug. Falls over when wording is indirect or when multiple temporal cues compete.
Local LLM with structured JSON output, fixed schema, and deterministic prompting (temperature=0). Better at reading between the lines on action intent and date/time extraction. The trade-off? ~257ร slower (95 ms vs 24.5 seconds per message).
| Type | Fields | Metrics |
|---|---|---|
| Classification | calendar_event_required, action_required, event_category, action_type |
Precision, recall, F1, macro F1 |
| Extraction | event_date, start_time, end_time, action_deadline |
Exact match accuracy |
| Operational | โ | Average latency (ms) |
| Error analysis | All fields | Field-level failure counts |
Event categories
none ยท meeting_admin ยท club_activity ยท trip ยท payment_deadline ยท cancellation_change ยท reminder_other
Action types
none ยท attend ยท pay ยท reply_confirm ยท bring_item ยท submit_form
- Python 3.14
- Ollama with Qwen 3 8B pulled locally (LLM evaluation only)
pip install -r requirements.txt# 1. Validate and split the dataset
python src/build_dataset.py
# 2. Run the rule-based baseline
python src/baseline_extractor.py
# 3. Evaluate baseline
python src/evaluate_predictions.py
python src/analyse_failures.py
# 4. Run the LLM extractor (requires Ollama + Qwen 3 8B)
python src/llm_extractor.py
# 5. Evaluate LLM output
# Update file paths in evaluate_predictions.py and analyse_failures.py
# to point to the Qwen output, then:
python src/evaluate_predictions.py
python src/analyse_failures.py
# 6. Generate comparison charts
python src/generate_visualisations.pyNote
Steps 3 and 5 require you to update the input file paths in the evaluation scripts depending on which extractor output you're evaluating. This is documented in the script comments.
๐ Rebuild the Enron data stages
python src/extract_enron_messages.py
python src/clean_real_world_data.py
python src/select_enron_eval_candidates.py
python src/build_enron_label_template.py
python src/append_enron_labels.pyemail-calendar-evaluation-pipeline/
โ
โโโ ๐ data/
โ โโโ raw/ # Enron maildir (local only, not committed)
โ โโโ intermediate/
โ โ โโโ enron_messages_raw.csv
โ โ โโโ enron_messages_clean.csv
โ โ โโโ enron_eval_candidates.csv
โ โโโ processed/
โ โโโ eval_dataset.csv
โ โโโ dev_dataset.csv
โ โโโ test_dataset.csv
โ โโโ enron_label_template.csv
โ โโโ enron_label_template_labeled.csv
โ
โโโ ๐ docs/
โ โโโ label_guide.md
โ
โโโ ๐ outputs/
โ โโโ baseline_predictions.csv
โ โโโ qwen_predictions.csv
โ โโโ summary_metrics.csv
โ โโโ field_metrics.csv
โ โโโ failure_summary.csv
โ โโโ qwen_summary_metrics.csv
โ โโโ qwen_field_metrics.csv
โ โโโ qwen_failure_summary.csv
โ โโโ charts/
โ โโโ metric_comparison.png
โ โโโ failure_comparison.png
โ โโโ latency_comparison.png
โ
โโโ ๐ src/
โ โโโ build_dataset.py
โ โโโ baseline_extractor.py
โ โโโ llm_extractor.py
โ โโโ evaluate_predictions.py
โ โโโ analyse_failures.py
โ โโโ extract_enron_messages.py
โ โโโ clean_real_world_data.py
โ โโโ select_enron_eval_candidates.py
โ โโโ build_enron_label_template.py
โ โโโ append_enron_labels.py
โ โโโ generate_visualisations.py
โ โโโ schemas.py
โ
โโโ requirements.txt
โโโ README.md
A few things that mattered more than expected in practice:
- Handling both clean benchmark timestamps and messy Enron-style timestamps required separate parsing paths
- Enron filenames with trailing dots caused issues on Windows
- Schema consistency across baseline and LLM outputs needed explicit enforcement
- Separating raw โ intermediate โ processed data stages kept things debuggable
| Limitation | Detail |
|---|---|
| Single-message only | No email threading support |
| One event + one action | Per message |
| No rich media | No attachment, image, or PDF processing |
| No location extraction | Not in schema |
| No recurring events | Single occurrence only |
| Small benchmark | 40 rows โ directional, not production-grade |
| One LLM tested | Only Qwen 3 8B in the final comparison |
- Expand the labelled benchmark with more Enron rows
- Benchmark a second local model (Mistral, Llama, etc.)
- Build a hybrid router that sends easy cases to rules and hard cases to the LLM
- Improve action deadline handling (weakest field for both methods)
- Add confusion matrices and per-category breakdowns
- Introduce softer scoring for the summary field
| Tool | Role |
|---|---|
| Python 3.14 | Core pipeline |
| Qwen 3 8B | Local LLM via Ollama |
| pandas | Data wrangling and evaluation |
| matplotlib | Visualisations |
| Enron Email Corpus | Real-world test data |
Built by Shawn D'Souza
Licensed under MIT


