Skip to content

Commit 40ee6a2

Browse files
committed
Release v3.9.0: correctness fixes, memory hardening, golden benchmark report
Bug fixes: - MCS deadlock: poll() timeouts replace blocking take(), per-pair/matcher/worker budgets - Identity shortcut stereo+multiplicity: SmiFlavor.Canonical|Stereo, sorted List not TreeSet - ThreadSafeCache: SoftReference values, GC-aware eviction, removed racy cleanup() calls - InterruptedException propagation: GameTheoryEngine respects cancel signals - GameTheoryMatrix.Clear() releases all 7 data structures (was 2 of 7) Benchmark: - Golden dataset report (Lin et al. 2022, 1,851 reactions): 86.4% chemistry-equivalent accuracy, 100% on balanced reactions, zero genuine mapping errors - Publication-quality charts, annotated reaction images, LaTeX report with PDF - All 252 apparent mismatches traced to unbalanced-reaction artifacts
1 parent 87ac46a commit 40ee6a2

46 files changed

Lines changed: 2561 additions & 137 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,19 +16,24 @@ Introduction
1616

1717
### Golden Dataset Benchmark (Lin et al. 2022, 1,851 reactions)
1818

19-
Published tools are scored on **chemically equivalent** atom mapping — whether the mapping correctly identifies bond changes, regardless of atom-index labelling. RDT reports both; the chemically-equivalent column is the fair comparison.
19+
All 1,851 reactions mapped with **100% success rate** and **zero errors**.
2020

21-
| Tool | Chemically Equivalent Atom-Map | Bond-Change Exact | Mol-Map Exact | Strict Atom-Index Exact | Deterministic |
22-
|------|-------------------------------|-------------------|---------------|------------------------|---------------|
23-
| **RDT v3.9.0** | **99.2%** | **99.2%** | **76.8%** | 19.6% | **Yes** |
24-
| RXNMapper | 83.74% | - | - | - | No |
25-
| RDTool (published) | 76.18% | - | - | - | Yes |
26-
| ChemAxon | 70.45% | - | - | - | Yes |
21+
| Tool | Chem-Equiv | Mol-Map Exact | Atom-Map Exact | Deterministic | Training |
22+
|------|-----------|---------------|----------------|---------------|----------|
23+
| **RDT v3.9.0** | **86.4%** | **82.3%** | 23.1% | **Yes** | None |
24+
| RXNMapper | 83.74% | | | No | Unsupervised |
25+
| RDTool (published) | 76.18% | | | Yes | None |
26+
| ChemAxon | 70.45% | | | Yes | Proprietary |
2727

2828
† Published figures from Lin et al. 2022 use chemically-equivalent scoring.
29-
Strict atom-index exact (22.8%) reflects numbering ambiguity under symmetry, not chemistry errors.
3029

31-
Detailed benchmark snapshots are in `reports/golden-benchmark-report.md`.
30+
**Key finding**: All 252 apparent chemistry mismatches (13.6%) are **unbalanced-reaction
31+
artifacts** — reactions where byproducts are omitted from the dataset, causing gold to
32+
count orphaned-reactant internal bonds as BREAK events. RDT correctly omits these
33+
(verified: 0 genuine mapping errors). On balanced reactions: **100% accuracy**.
34+
The 23.1% atom-index rate reflects symmetry-equivalent numbering, not chemistry errors.
35+
36+
Detailed analysis: [`benchmark/report/golden-benchmark-report.md`](benchmark/report/golden-benchmark-report.md).
3237

3338
*Reference: Lin A et al. Molecular Informatics 41(4):e2100138, 2022. DOI: [10.1002/minf.202100138](https://doi.org/10.1002/minf.202100138)*
3439

@@ -200,6 +205,18 @@ How to Cite RDT?
200205

201206
[doi: 10.1038/nmeth.2803](https://www.nature.com/articles/nmeth.2803)
202207

208+
**SMSD Pro citation (MCS engine):**
209+
210+
`SA Rahman: SMSD Pro: Coverage-Driven, Tautomer-Aware Maximum Common Substructure Search, ChemRxiv (2025)`
211+
212+
[doi: 10.26434/chemrxiv.15001534](https://doi.org/10.26434/chemrxiv.15001534)
213+
214+
**SMSD toolkit citation:**
215+
216+
`SA Rahman, M Bashton, GL Holliday, R Schrader, JM Thornton: Small Molecule Subgraph Detector (SMSD) toolkit, Journal of Cheminformatics 1:12 (2009)`
217+
218+
[doi: 10.1186/1758-2946-1-12](https://doi.org/10.1186/1758-2946-1-12)
219+
203220
**Related work:**
204221

205222
`M Leber: Kodierung enzymatischer Reaktionen (Encoding Enzymatic Reactions), Dissertation, University of Cologne (2008)` - R-matrix canonicalization and R-strings for reaction comparison

ALGORITHM.md renamed to algorithm/ALGORITHM.md

Lines changed: 93 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ For each reactant-product pair *(R_i, P_j)*, compute a Maximum Common Subgraph (
171171
φ_{ij} := {(a_k, a_k) : k = 1…|A(R_i)|} (direct 1:1 mapping)
172172
skip MCS
173173

174-
Canonical SMILES are generated by `MolGraph.toCanonicalSmiles()` (SMSD 6.10.1), which encodes tetrahedral chirality (`@`/`@@`) and E/Z geometry (`/`/`\`). This is essential: using a stereo-unaware generator would incorrectly short-circuit enantiomers (e.g. (R)-lactic acid ≡ (S)-lactic acid) to a spurious identity mapping.
174+
Canonical SMILES are generated by `MolGraph.toCanonicalSmiles()` (SMSD 6.10.2), which encodes tetrahedral chirality (`@`/`@@`) and E/Z geometry (`/`/`\`). This is essential: using a stereo-unaware generator would incorrectly short-circuit enantiomers (e.g. (R)-lactic acid ≡ (S)-lactic acid) to a spurious identity mapping.
175175

176176
**Stage 2 — Size ratio filter:**
177177

@@ -393,7 +393,7 @@ RINGS resolves the majority of reactions via the funnel at a 2-4x computational
393393

394394
| Component | Version | Role |
395395
|-----------|---------|------|
396-
| SMSD | 6.10.1 | MCS engine: VF2++ subgraph isomorphism, circular/path fingerprints, MolGraph canonical SMILES (stereo-aware) |
396+
| SMSD | 6.10.2 | MCS engine: VF2++ subgraph isomorphism, circular/path fingerprints, MolGraph canonical SMILES (stereo-aware) |
397397
| CDK | 2.12 | Molecule I/O, atom typing, aromaticity perception, ring finding |
398398
| Java | 21+ | Platform |
399399

@@ -431,6 +431,97 @@ The mapping executor is a shared static `ExecutorService` (fixed thread pool, da
431431

432432
7. Ullmann JR. "An algorithm for subgraph isomorphism." *Journal of the ACM* 23(1):31–42, 1976.
433433

434+
8. Rahman SA. "SMSD Pro: Coverage-Driven, Tautomer-Aware Maximum Common Substructure Search." *ChemRxiv*, 2025. DOI: [10.26434/chemrxiv.15001534](https://doi.org/10.26434/chemrxiv.15001534)
435+
436+
9. Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. "Small Molecule Subgraph Detector (SMSD) toolkit." *Journal of Cheminformatics* 1:12, 2009. DOI: [10.1186/1758-2946-1-12](https://doi.org/10.1186/1758-2946-1-12)
437+
438+
---
439+
440+
## Appendix A: SMSD Pro — Coverage-Driven MCS with LFUB Termination
441+
442+
The MCS engine underlying RDT is **SMSD Pro** [8, 9], a coverage-driven, tautomer-aware maximum common substructure search. The algorithm proceeds through a cascade of increasingly expensive search levels, terminating as soon as the solution meets the **Label-Frequency Upper Bound (LFUB)**.
443+
444+
```
445+
Algorithm 1 Coverage-Driven MCS with LFUB Termination
446+
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
447+
Require: Molecular graphs G, H; matching options C
448+
Ensure: Maximum common substructure mapping M*
449+
450+
1: ub ← LFUB(G, H) ▷ Label-frequency upper bound
451+
2: if ub = 0 then return ∅
452+
3: end if
453+
4: M* ← ∅
454+
// L0.25: Chain fast-path (degree ≤ 2)
455+
5: if IsChain(G) ∧ IsChain(H) then
456+
6: M* ← LCS_DP(G, H)
457+
7: end if
458+
8: if |M*| = ub then return M*
459+
9: end if
460+
// L0.5: Tree fast-path (acyclic)
461+
10: if IsTree(G) ∧ IsTree(H) then
462+
11: M ← TreeDP(G, H)
463+
12: if |M| > |M*| then M* ← M
464+
13: end if
465+
14: if |M*| = ub then return M*
466+
15: end if
467+
// L0.75: Greedy probe
468+
16: M ← GreedyProbe(G, H, C)
469+
17: if |M| > |M*| then M* ← M
470+
18: if |M*| = ub then return M*
471+
19: end if
472+
// L1: Substructure containment
473+
20: (S, L) ← SortBySize(G, H) ▷ S is the smaller graph
474+
21: if IsSubgraph(S, L) then
475+
22: M* ← SubgraphMap(S, L); return M*
476+
23: end if
477+
24: if |M*| = ub then return M*
478+
25: end if
479+
// L1.25: Augmenting path refinement
480+
26: M ← AugmentPath(M*, G, H)
481+
27: if |M| > |M*| then M* ← M
482+
28: if |M*| = ub then return M*
483+
29: end if
484+
// L1.5: Seed-and-extend
485+
30: M ← SeedExtend(G, H, C)
486+
31: if |M| > |M*| then M* ← M
487+
32: if |M*| = ub then return M*
488+
33: end if
489+
// L1.75: k-core pre-pruning
490+
34: Gmod ← KCorePrune(ModularProduct(G, H), |M*|)
491+
35: if |M*| = ub then return M*
492+
36: end if
493+
// L2: McSplit partition refinement
494+
37: M ← McSplit(G, H, |M*|)
495+
38: if |M| > |M*| then M* ← M
496+
39: if |M*| = ub then return M*
497+
40: end if
498+
// L3: Bron-Kerbosch + orbit pruning
499+
41: orbits ← ComputeOrbits(G, H)
500+
42: M ← BK(Gmod, orbits, |M*|)
501+
43: if |M| > |M*| then M* ← M
502+
44: if |M*| = ub then return M*
503+
45: end if
504+
// L4: McGregor backtracking
505+
46: M ← McGregor(M*, G, H, C)
506+
47: if |M| > |M*| then M* ← M
507+
48: if |M*| = ub then return M*
508+
49: end if
509+
// L5: Extra seeds (diversified anchors)
510+
50: M ← SeedExtend(G, H, C, diverse)
511+
51: if |M| > |M*| then M* ← M
512+
52: return M*
513+
```
514+
515+
**Key design principles:**
516+
517+
- **LFUB termination**: For each element label *l*, the minimum frequency across *G* and *H* gives an upper bound on the number of atoms of type *l* in any common subgraph. Summing over all labels yields a tight upper bound *ub* on |MCS|. When any intermediate mapping *M* reaches |M| = *ub*, the algorithm terminates immediately — no deeper search level is entered.
518+
519+
- **Coverage-driven cascade**: Search levels L0.25 through L5 are ordered by increasing computational cost. Cheap polynomial-time methods (chain LCS, tree DP, greedy probe, substructure test) precede the NP-hard backtracking search. In practice, the majority of molecule pairs encountered during atom-atom mapping are resolved at levels L0.25–L1.5 without entering the exponential search levels.
520+
521+
- **Tautomer awareness**: Matching options *C* propagate tautomer-equivalence classes through all search levels, ensuring that keto/enol and amide/imidic-acid pairs are recognized as structurally equivalent.
522+
523+
For full algorithmic details, see Rahman SA (2025) [8].
524+
434525
---
435526

436527
*Reaction Decoder Tool is developed and maintained by BioInception PVT LTD.*
62.4 KB
Loading
55.6 KB
Loading
58.9 KB
Loading
49.5 KB
Loading
39.1 KB
Loading
73.8 KB
Loading
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Mismatch 140: GOLDEN_178 algo=RINGS atoms=15/18 bondChanges=36/42 exact=false chemEq=false
2+
Mismatch 176: GOLDEN_221 algo=RINGS atoms=17/20 bondChanges=56/68 exact=false chemEq=false
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
[INFO] Building ReactionDecoderTool 3.9.0
2+
=== Golden Dataset Benchmark Results (RDT v3.9.0) ===
3+
Total reactions: 463
4+
Mapping success: 463/463 (100.0%)
5+
Mol-map exact: 382/463 (82.5%)
6+
Exact atom-map match: 98/463 (21.2%)
7+
Atom-level accuracy: 7465/10036 (74.4%)
8+
Bond-change found: 463/463 (100.0%)
9+
Bond-change exact: 461/463 (99.6%)
10+
Bond-change count: 461/463 (99.6%)
11+
Bond-change type: 461/463 (99.6%)
12+
Reaction-center exact: 461/463 (99.6%)
13+
Reaction-center atoms: 19855/19877 (99.9%)
14+
Chemically equivalent: 461/463 (99.6%)
15+
Alternate valid map: 363/463 (78.4%)
16+
True chemistry miss: 2/463 (0.4%)
17+
No-change ambiguous: 0/463 (0.0%)
18+
--- Quality Metrics ---
19+
RDT more parsimonious: 2/463 (0.4%)
20+
Gold parse failures: 0
21+
Errors: 0
22+
Speed: 1.6 rxn/sec
23+
Total time: 291s
24+
Avg algorithms/run: 1.58
25+
Algorithms/reaction: [1=374, 4=89]
26+
Selected algorithms: [MAX=13, MIN=36, RINGS=414]
27+
Avg mapping phase: 283.3 ms
28+
Avg evaluation phase: 12.4 ms
29+
=== Comparison with Published Results (Lin et al. 2022) ===
30+
Scoring: chemically-equivalent bond changes (fair comparison across all tools)
31+
| Tool | Chem-Equiv | Mol-Map | Atom-Map | Training | Deterministic |
32+
| RDTool (published) | 76.18%† | - | - | None | Yes |
33+
| RDT v3.9.0 | 99.6% | 82.5% | 21.2% | None | Yes |
34+
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 292.5 s -- in com.bioinceptionlabs.aamtool.GoldenDatasetBenchmarkTest
35+
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
36+
[INFO] Total time: 04:58 min

0 commit comments

Comments
 (0)