Feature/scripts and api optimizations by SimoneAriens · Pull Request #215 · NetherlandsForensicInstitute/scratch

SimoneAriens · 2026-04-03T13:38:14Z

Changes to support running 600k+ comparisons efficiently:

Skip plots: added skip_plots flag to score endpoints and client. Matplotlib plot generation was 96% of per-call API time (~3s). Skipping it drops calls to ~0.1s.
Upfront existence check: replaced 600k individual exists() calls with a single os.walk scan to find already-completed results.
Two-level output folders: {i // 1000:04d}/{i:06d} structure to avoid 600k entries in a single directory.
Vault cleanup: _cleanup_vault in try/finally to prevent /tmp/scratch_api from growing unboundedly (was hitting 83GB+), even on partial download failures.
404 filtering: skip downloading plot URLs that don't exist when skip_plots=True.
Producer/consumer pattern in convert_marks.py: parallel API fetching with sequential disk writes.
Profiling instrumentation: timing logs in score endpoints and calculate_score.
Resilient error handling: all three scripts catch and log per-item failures instead of aborting the batch. calculate_score returns a ScoreStatus enum for structured summary reporting.
Connection reuse: requests.Session for connection pooling across downloads.
Lazy cross-product sampling: _different_source_pairs samples via cumulative indexing instead of materializing the full pool.

github-actions · 2026-04-03T13:44:09Z

Diff Coverage

Diff: origin/main..HEAD, staged and unstaged changes

src/processors/router.py (100%)
src/processors/schemas.py (100%)

Summary

Total: 21 lines
Missing: 0 lines
Coverage: 100%

github-actions · 2026-04-03T13:44:10Z

Package	Line Rate	Branch Rate	Health
.	96%	92%	✔
computations	94%	67%	✔
container_models	99%	100%	✔
conversion	96%	89%	✔
conversion.export	99%	93%	✔
conversion.filter	97%	89%	✔
conversion.leveling	100%	100%	✔
conversion.leveling.solver	100%	75%	✔
conversion.plots	99%	88%	✔
conversion.preprocess_impression	99%	91%	✔
conversion.preprocess_striation	90%	62%	✔
conversion.profile_correlator	96%	82%	✔
conversion.surface_comparison	99%	89%	✔
conversion.surface_comparison.cell_registration	100%	90%	✔
extractors	97%	75%	✔
mutations	100%	100%	✔
parsers	97%	50%	✔
parsers.patches	89%	60%	✔
preprocessors	100%	100%	✔
processors	100%	75%	✔
renders	99%	50%	✔
utils	71%	100%	➖
Summary	98% (3261 / 3331)	87% (341 / 394)	✔

Minimum allowed line rate is 50%

SimoneAriens added 2 commits April 3, 2026 15:24

fix conversion scripts

04c5179

cleanup

90226b2

SimoneAriens requested review from cfs-data and vergep April 3, 2026 13:38

SimoneAriens marked this pull request as draft April 3, 2026 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/scripts and api optimizations#215

Feature/scripts and api optimizations#215
SimoneAriens wants to merge 2 commits intomainfrom
feature/scripts-and-api-optimizations

SimoneAriens commented Apr 3, 2026

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SimoneAriens commented Apr 3, 2026

Uh oh!

github-actions Bot commented Apr 3, 2026

Diff Coverage

Diff: origin/main..HEAD, staged and unstaged changes

Summary

Uh oh!

github-actions Bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant