You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+26-9Lines changed: 26 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,19 +16,24 @@ Introduction
16
16
17
17
### Golden Dataset Benchmark (Lin et al. 2022, 1,851 reactions)
18
18
19
-
Published tools are scored on**chemically equivalent**atom mapping — whether the mapping correctly identifies bond changes, regardless of atom-index labelling. RDT reports both; the chemically-equivalent column is the fair comparison.
19
+
All 1,851 reactions mapped with**100% success rate**and **zero errors**.
Canonical SMILES are generated by `MolGraph.toCanonicalSmiles()` (SMSD 6.10.1), which encodes tetrahedral chirality (`@`/`@@`) and E/Z geometry (`/`/`\`). This is essential: using a stereo-unaware generator would incorrectly short-circuit enantiomers (e.g. (R)-lactic acid ≡ (S)-lactic acid) to a spurious identity mapping.
174
+
Canonical SMILES are generated by `MolGraph.toCanonicalSmiles()` (SMSD 6.10.2), which encodes tetrahedral chirality (`@`/`@@`) and E/Z geometry (`/`/`\`). This is essential: using a stereo-unaware generator would incorrectly short-circuit enantiomers (e.g. (R)-lactic acid ≡ (S)-lactic acid) to a spurious identity mapping.
175
175
176
176
**Stage 2 — Size ratio filter:**
177
177
@@ -393,7 +393,7 @@ RINGS resolves the majority of reactions via the funnel at a 2-4x computational
| CDK | 2.12 | Molecule I/O, atom typing, aromaticity perception, ring finding |
398
398
| Java | 21+ | Platform |
399
399
@@ -431,6 +431,97 @@ The mapping executor is a shared static `ExecutorService` (fixed thread pool, da
431
431
432
432
7. Ullmann JR. "An algorithm for subgraph isomorphism." *Journal of the ACM* 23(1):31–42, 1976.
433
433
434
+
8. Rahman SA. "SMSD Pro: Coverage-Driven, Tautomer-Aware Maximum Common Substructure Search." *ChemRxiv*, 2025. DOI: [10.26434/chemrxiv.15001534](https://doi.org/10.26434/chemrxiv.15001534)
435
+
436
+
9. Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. "Small Molecule Subgraph Detector (SMSD) toolkit." *Journal of Cheminformatics* 1:12, 2009. DOI: [10.1186/1758-2946-1-12](https://doi.org/10.1186/1758-2946-1-12)
437
+
438
+
---
439
+
440
+
## Appendix A: SMSD Pro — Coverage-Driven MCS with LFUB Termination
441
+
442
+
The MCS engine underlying RDT is **SMSD Pro**[8, 9], a coverage-driven, tautomer-aware maximum common substructure search. The algorithm proceeds through a cascade of increasingly expensive search levels, terminating as soon as the solution meets the **Label-Frequency Upper Bound (LFUB)**.
443
+
444
+
```
445
+
Algorithm 1 Coverage-Driven MCS with LFUB Termination
Require: Molecular graphs G, H; matching options C
448
+
Ensure: Maximum common substructure mapping M*
449
+
450
+
1: ub ← LFUB(G, H) ▷ Label-frequency upper bound
451
+
2: if ub = 0 then return ∅
452
+
3: end if
453
+
4: M* ← ∅
454
+
// L0.25: Chain fast-path (degree ≤ 2)
455
+
5: if IsChain(G) ∧ IsChain(H) then
456
+
6: M* ← LCS_DP(G, H)
457
+
7: end if
458
+
8: if |M*| = ub then return M*
459
+
9: end if
460
+
// L0.5: Tree fast-path (acyclic)
461
+
10: if IsTree(G) ∧ IsTree(H) then
462
+
11: M ← TreeDP(G, H)
463
+
12: if |M| > |M*| then M* ← M
464
+
13: end if
465
+
14: if |M*| = ub then return M*
466
+
15: end if
467
+
// L0.75: Greedy probe
468
+
16: M ← GreedyProbe(G, H, C)
469
+
17: if |M| > |M*| then M* ← M
470
+
18: if |M*| = ub then return M*
471
+
19: end if
472
+
// L1: Substructure containment
473
+
20: (S, L) ← SortBySize(G, H) ▷ S is the smaller graph
474
+
21: if IsSubgraph(S, L) then
475
+
22: M* ← SubgraphMap(S, L); return M*
476
+
23: end if
477
+
24: if |M*| = ub then return M*
478
+
25: end if
479
+
// L1.25: Augmenting path refinement
480
+
26: M ← AugmentPath(M*, G, H)
481
+
27: if |M| > |M*| then M* ← M
482
+
28: if |M*| = ub then return M*
483
+
29: end if
484
+
// L1.5: Seed-and-extend
485
+
30: M ← SeedExtend(G, H, C)
486
+
31: if |M| > |M*| then M* ← M
487
+
32: if |M*| = ub then return M*
488
+
33: end if
489
+
// L1.75: k-core pre-pruning
490
+
34: Gmod ← KCorePrune(ModularProduct(G, H), |M*|)
491
+
35: if |M*| = ub then return M*
492
+
36: end if
493
+
// L2: McSplit partition refinement
494
+
37: M ← McSplit(G, H, |M*|)
495
+
38: if |M| > |M*| then M* ← M
496
+
39: if |M*| = ub then return M*
497
+
40: end if
498
+
// L3: Bron-Kerbosch + orbit pruning
499
+
41: orbits ← ComputeOrbits(G, H)
500
+
42: M ← BK(Gmod, orbits, |M*|)
501
+
43: if |M| > |M*| then M* ← M
502
+
44: if |M*| = ub then return M*
503
+
45: end if
504
+
// L4: McGregor backtracking
505
+
46: M ← McGregor(M*, G, H, C)
506
+
47: if |M| > |M*| then M* ← M
507
+
48: if |M*| = ub then return M*
508
+
49: end if
509
+
// L5: Extra seeds (diversified anchors)
510
+
50: M ← SeedExtend(G, H, C, diverse)
511
+
51: if |M| > |M*| then M* ← M
512
+
52: return M*
513
+
```
514
+
515
+
**Key design principles:**
516
+
517
+
-**LFUB termination**: For each element label *l*, the minimum frequency across *G* and *H* gives an upper bound on the number of atoms of type *l* in any common subgraph. Summing over all labels yields a tight upper bound *ub* on |MCS|. When any intermediate mapping *M* reaches |M| = *ub*, the algorithm terminates immediately — no deeper search level is entered.
518
+
519
+
-**Coverage-driven cascade**: Search levels L0.25 through L5 are ordered by increasing computational cost. Cheap polynomial-time methods (chain LCS, tree DP, greedy probe, substructure test) precede the NP-hard backtracking search. In practice, the majority of molecule pairs encountered during atom-atom mapping are resolved at levels L0.25–L1.5 without entering the exponential search levels.
520
+
521
+
-**Tautomer awareness**: Matching options *C* propagate tautomer-equivalence classes through all search levels, ensuring that keto/enol and amide/imidic-acid pairs are recognized as structurally equivalent.
522
+
523
+
For full algorithmic details, see Rahman SA (2025) [8].
524
+
434
525
---
435
526
436
527
*Reaction Decoder Tool is developed and maintained by BioInception PVT LTD.*
0 commit comments