Skip to content

shawn-d123/email-calendar-evaluation-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

17 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“ง Email Calendar Evaluation Pipeline

Python Ollama Licence

Rules vs. LLM โ€” who extracts calendar events from messy emails better?

A Python evaluation pipeline that benchmarks a rule-based baseline against a local LLM
on structured extraction from raw email-style messages.

Getting Started ยท Results ยท Dataset ยท Methods


๐Ÿง  The Problem

Assistant-style products need to pull actionable structure out of messy text โ€” calendar events, reminders, deadlines, action items. That gets hard fast when messages contain multiple dates, relative time expressions, cancellations, ambiguous phrasing, and inconsistent formatting.

This project builds a repeatable evaluation pipeline around that problem rather than just running a single model and eyeballing the output.

๐Ÿ“จ Raw Email  โ”€โ”€โ–ถ  ๐Ÿ”ง Rule-Based + LLM Extractor  โ”€โ”€โ–ถ  ๐Ÿ“Š Evaluate & Score  โ”€โ”€โ–ถ  ๐Ÿ“ˆ Compare & Visualise

๐Ÿ” What It Extracts

Field Description
calendar_event_required Should an event be created?
event_category Type of event
event_date When it happens
start_time / end_time Time window
action_required Does the reader need to do something?
action_type What kind of action
action_deadline By when
summary Short description

Two extraction methods are benchmarked side-by-side against a 40-row labelled benchmark (20 synthetic + 20 Enron-derived).


๐Ÿ“Š Results at a Glance

Metric Rule-Based Baseline Qwen 3 8B Edge
Avg latency 95 ms 24,516 ms ๐ŸŸฆ Baseline
Calendar event F1 0.917 0.902 ๐ŸŸฆ Baseline
Action required F1 0.760 0.964 ๐ŸŸง Qwen
Event category macro F1 0.776 0.733 ๐ŸŸฆ Baseline
Action type macro F1 0.629 0.819 ๐ŸŸง Qwen
Event date accuracy 0.700 0.875 ๐ŸŸง Qwen
Action deadline accuracy 0.800 0.800 โฌœ Tie

Tip

Bottom line: Neither system wins outright. The strongest practical outcome is a hybrid approach โ€” let rules handle the easy stuff fast, route the harder cases to an LLM.

๐Ÿ“ˆ View charts

Metric Comparison

Metric Comparison

Failure Count by Field

Failure Comparison

Latency

Latency Comparison


๐Ÿ“ Dataset

Benchmark composition

Source Rows Examples
Synthetic 20 Trip reminders, parent meetings, club updates, payment deadlines, cancellations
Enron-derived 20 Real corporate email language, messier formatting, less predictable structure

Enron data pipeline

1,000 raw emails  โ†’  -32 dupes  โ†’  778 clean  โ†’  120 candidates  โ†’  20 labelled

Raw Enron maildir is not committed (too large). The repo includes cleaned artefacts, labelled data, and all outputs.


โš™๏ธ Methods

๐ŸŸฆ Rule-based baseline

Keyword matching, regex, date parsing, and priority rules. Fast, interpretable, easy to debug. Falls over when wording is indirect or when multiple temporal cues compete.

๐ŸŸง Qwen 3 8B (Ollama)

Local LLM with structured JSON output, fixed schema, and deterministic prompting (temperature=0). Better at reading between the lines on action intent and date/time extraction. The trade-off? ~257ร— slower (95 ms vs 24.5 seconds per message).


๐Ÿ“ Evaluation Approach

Type Fields Metrics
Classification calendar_event_required, action_required, event_category, action_type Precision, recall, F1, macro F1
Extraction event_date, start_time, end_time, action_deadline Exact match accuracy
Operational โ€” Average latency (ms)
Error analysis All fields Field-level failure counts

๐Ÿ—‚๏ธ Extraction Schema

Event categories

none ยท meeting_admin ยท club_activity ยท trip ยท payment_deadline ยท cancellation_change ยท reminder_other

Action types

none ยท attend ยท pay ยท reply_confirm ยท bring_item ยท submit_form


๐Ÿš€ Quickstart

Requirements

  • Python 3.14
  • Ollama with Qwen 3 8B pulled locally (LLM evaluation only)

Install

pip install -r requirements.txt

Run the pipeline

# 1. Validate and split the dataset
python src/build_dataset.py

# 2. Run the rule-based baseline
python src/baseline_extractor.py

# 3. Evaluate baseline
python src/evaluate_predictions.py
python src/analyse_failures.py

# 4. Run the LLM extractor (requires Ollama + Qwen 3 8B)
python src/llm_extractor.py

# 5. Evaluate LLM output
#    Update file paths in evaluate_predictions.py and analyse_failures.py
#    to point to the Qwen output, then:
python src/evaluate_predictions.py
python src/analyse_failures.py

# 6. Generate comparison charts
python src/generate_visualisations.py

Note

Steps 3 and 5 require you to update the input file paths in the evaluation scripts depending on which extractor output you're evaluating. This is documented in the script comments.

๐Ÿ”„ Rebuild the Enron data stages
python src/extract_enron_messages.py
python src/clean_real_world_data.py
python src/select_enron_eval_candidates.py
python src/build_enron_label_template.py
python src/append_enron_labels.py

๐Ÿ—ƒ๏ธ Repo Structure

email-calendar-evaluation-pipeline/
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ data/
โ”‚   โ”œโ”€โ”€ raw/                          # Enron maildir (local only, not committed)
โ”‚   โ”œโ”€โ”€ intermediate/
โ”‚   โ”‚   โ”œโ”€โ”€ enron_messages_raw.csv
โ”‚   โ”‚   โ”œโ”€โ”€ enron_messages_clean.csv
โ”‚   โ”‚   โ””โ”€โ”€ enron_eval_candidates.csv
โ”‚   โ””โ”€โ”€ processed/
โ”‚       โ”œโ”€โ”€ eval_dataset.csv
โ”‚       โ”œโ”€โ”€ dev_dataset.csv
โ”‚       โ”œโ”€โ”€ test_dataset.csv
โ”‚       โ”œโ”€โ”€ enron_label_template.csv
โ”‚       โ””โ”€โ”€ enron_label_template_labeled.csv
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ docs/
โ”‚   โ””โ”€โ”€ label_guide.md
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ outputs/
โ”‚   โ”œโ”€โ”€ baseline_predictions.csv
โ”‚   โ”œโ”€โ”€ qwen_predictions.csv
โ”‚   โ”œโ”€โ”€ summary_metrics.csv
โ”‚   โ”œโ”€โ”€ field_metrics.csv
โ”‚   โ”œโ”€โ”€ failure_summary.csv
โ”‚   โ”œโ”€โ”€ qwen_summary_metrics.csv
โ”‚   โ”œโ”€โ”€ qwen_field_metrics.csv
โ”‚   โ”œโ”€โ”€ qwen_failure_summary.csv
โ”‚   โ””โ”€โ”€ charts/
โ”‚       โ”œโ”€โ”€ metric_comparison.png
โ”‚       โ”œโ”€โ”€ failure_comparison.png
โ”‚       โ””โ”€โ”€ latency_comparison.png
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ src/
โ”‚   โ”œโ”€โ”€ build_dataset.py
โ”‚   โ”œโ”€โ”€ baseline_extractor.py
โ”‚   โ”œโ”€โ”€ llm_extractor.py
โ”‚   โ”œโ”€โ”€ evaluate_predictions.py
โ”‚   โ”œโ”€โ”€ analyse_failures.py
โ”‚   โ”œโ”€โ”€ extract_enron_messages.py
โ”‚   โ”œโ”€โ”€ clean_real_world_data.py
โ”‚   โ”œโ”€โ”€ select_enron_eval_candidates.py
โ”‚   โ”œโ”€โ”€ build_enron_label_template.py
โ”‚   โ”œโ”€โ”€ append_enron_labels.py
โ”‚   โ”œโ”€โ”€ generate_visualisations.py
โ”‚   โ””โ”€โ”€ schemas.py
โ”‚
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

๐Ÿ› ๏ธ Implementation Notes

A few things that mattered more than expected in practice:

  • Handling both clean benchmark timestamps and messy Enron-style timestamps required separate parsing paths
  • Enron filenames with trailing dots caused issues on Windows
  • Schema consistency across baseline and LLM outputs needed explicit enforcement
  • Separating raw โ†’ intermediate โ†’ processed data stages kept things debuggable

โš ๏ธ Known Limitations

Limitation Detail
Single-message only No email threading support
One event + one action Per message
No rich media No attachment, image, or PDF processing
No location extraction Not in schema
No recurring events Single occurrence only
Small benchmark 40 rows โ€” directional, not production-grade
One LLM tested Only Qwen 3 8B in the final comparison

๐Ÿ”ฎ Possible Extensions

  • Expand the labelled benchmark with more Enron rows
  • Benchmark a second local model (Mistral, Llama, etc.)
  • Build a hybrid router that sends easy cases to rules and hard cases to the LLM
  • Improve action deadline handling (weakest field for both methods)
  • Add confusion matrices and per-category breakdowns
  • Introduce softer scoring for the summary field

๐Ÿงฐ Built With

Tool Role
Python 3.14 Core pipeline
Qwen 3 8B Local LLM via Ollama
pandas Data wrangling and evaluation
matplotlib Visualisations
Enron Email Corpus Real-world test data

Built by Shawn D'Souza

Licensed under MIT

About

Python evaluation pipeline for email event and action extraction, comparing a rule-based baseline with a local LLM using cleaned real-world data, structured benchmarking, and failure analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages