|
| 1 | +# Reaction Decoder Tool (RDT) v3.7.0 — Algorithm Description |
| 2 | + |
| 3 | +**Authors:** Syed Asad Rahman |
| 4 | +**Contact:** asad.rahman@bioinceptionlabs.com |
| 5 | +**License:** GNU LGPL v3.0 |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1. Overview |
| 10 | + |
| 11 | +The Reaction Decoder Tool (RDT) performs deterministic atom-atom mapping (AAM) for chemical reactions without any training data or machine learning. Given an input reaction (reactants and products), RDT identifies which atom in each reactant corresponds to which atom in each product, enabling the identification of reaction centres, bond changes, and reaction mechanisms. |
| 12 | + |
| 13 | +**Key Innovation:** A multi-algorithm ensemble approach with game-theory-inspired matrix optimization. Four complementary mapping algorithms (MAX, MIN, MIXTURE, RINGS) explore different regions of the solution space. A 15-condition decision tree selects the optimal mapping based on bond parsimony, thermodynamic feasibility, and stereochemical preservation. |
| 14 | + |
| 15 | +**Benchmark Result:** 96.0% atom-level accuracy on the Lin et al. (2022) golden dataset of 1,851 manually curated reactions, outperforming all published tools including RXNMapper (83.74%) and the original RDTool (76.18%), without any training data. |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## 2. Algorithm Pipeline |
| 20 | + |
| 21 | +The algorithm proceeds through nine sequential stages: |
| 22 | + |
| 23 | +``` |
| 24 | +Input Reaction |
| 25 | + | |
| 26 | + v |
| 27 | +[Stage 1] Parsing & Preprocessing |
| 28 | + | |
| 29 | + v |
| 30 | +[Stage 2] Reaction Standardization (reagent filtering, atom balance) |
| 31 | + | |
| 32 | + v |
| 33 | +[Stage 3] RINGS Funnel (quality gate) |
| 34 | + | |
| 35 | + v yes |
| 36 | +[Stage 3a] RINGS sufficient? ---------> Use RINGS mapping |
| 37 | + | |
| 38 | + | no |
| 39 | + v |
| 40 | +[Stage 4] Parallel execution: MIN, MAX, MIXTURE (+RINGS if not run) |
| 41 | + | |
| 42 | + v |
| 43 | +[Stage 5] Pairwise MCS computation (substructure + VFLibMCS) |
| 44 | + | |
| 45 | + v |
| 46 | +[Stage 6] Game theory matrix construction (7 scoring matrices) |
| 47 | + | |
| 48 | + v |
| 49 | +[Stage 7] Algorithm-specific winner selection |
| 50 | + | |
| 51 | + v |
| 52 | +[Stage 8] Cross-algorithm solution selection (15-condition tree) |
| 53 | + | |
| 54 | + v |
| 55 | +[Stage 9] Bond change annotation & output |
| 56 | +``` |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +### Stage 1: Input & Preprocessing |
| 61 | + |
| 62 | +**Input formats:** Reaction SMILES, RXN (V2000/V3000), RDF, or CDK IReaction objects. |
| 63 | + |
| 64 | +**Preprocessing steps:** |
| 65 | +1. Parse reaction into separate reactant and product molecule containers |
| 66 | +2. Set all null implicit hydrogen counts to zero |
| 67 | +3. Remove any pre-existing atom-atom mapping properties |
| 68 | +4. Perceive atom types and aromaticity using CDK's `AtomTypeMatcher` and `Aromaticity` models |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +### Stage 2: Reaction Standardization |
| 73 | + |
| 74 | +**Purpose:** Remove non-reactive species (solvents, catalysts, reagents) to focus mapping on the reacting core. |
| 75 | + |
| 76 | +**Three-tier reagent filter:** |
| 77 | + |
| 78 | +| Tier | Method | Criteria | |
| 79 | +|------|--------|----------| |
| 80 | +| **1. Known reagents** | Canonical SMILES lookup | Match against database of 30+ common solvents (DCM, DMSO, DMF, THF, water, etc.) and inorganic salts. **Atom-balance guard:** only filter if removing the molecule does not unbalance the reaction. | |
| 81 | +| **2. Catalyst metals** | Element symbol check | Molecules containing Pd, Pt, Rh, Ru, Ir, Ni, Cu, Fe, Co, Mn, Ti, Zr, Mo, W, Os, Ag, Au | |
| 82 | +| **3. Fingerprint similarity** | ECFP4 Tanimoto | For each reactant, compute maximum Tanimoto similarity to any product using ECFP4 circular fingerprints (radius=2, 256 bits). If max similarity < 0.4, heavy atom count <= 10, and no unique element contributions to products: classify as reagent. **Atom-balance guard** prevents filtering genuine reactants (e.g., water in hydrolysis). | |
| 83 | + |
| 84 | +**Atom balance validation:** Count all non-hydrogen atoms per element in reactants vs products. Log warning if imbalanced (does not prevent mapping, as many real reactions are intentionally unbalanced in their written form). |
| 85 | + |
| 86 | +--- |
| 87 | + |
| 88 | +### Stage 3: RINGS Funnel Architecture |
| 89 | + |
| 90 | +**Purpose:** Avoid unnecessary computation by testing if the ring-conservation algorithm alone produces a sufficient mapping. |
| 91 | + |
| 92 | +**Algorithm:** |
| 93 | +1. If `totalMolecules <= 5`: execute RINGS algorithm in a single thread |
| 94 | +2. Evaluate quality: does the mapping cover >= 95% of non-hydrogen atoms? |
| 95 | +3. Verify non-identity: confirm the reaction involves actual structural changes (not a transporter) |
| 96 | +4. **Decision:** If RINGS mapping is sufficient, skip the remaining three algorithms |
| 97 | + |
| 98 | +**Efficiency gain:** Approximately 75% of reactions are resolved by RINGS alone, providing a 2-4x speedup over the full four-algorithm pipeline. |
| 99 | + |
| 100 | +--- |
| 101 | + |
| 102 | +### Stage 4: Parallel Multi-Algorithm Execution |
| 103 | + |
| 104 | +When RINGS is insufficient, the remaining algorithms execute in parallel: |
| 105 | + |
| 106 | +| Algorithm | Strategy | Objective | |
| 107 | +|-----------|----------|-----------| |
| 108 | +| **MAX** | Global maximization | Maximize total mapped atoms across all reactant-product pairs | |
| 109 | +| **MIN** | Local minimization | Minimize total bond changes (parsimony principle) | |
| 110 | +| **MIXTURE** | Hybrid max-min | Maximize mapped atoms with secondary minimization of bond changes | |
| 111 | +| **RINGS** | Ring-centric | Prioritize ring system preservation and aromatic skeleton conservation | |
| 112 | + |
| 113 | +**Execution:** Fixed thread pool with `min(available_processors - 1, 4)` threads. Results collected via `CompletionService` in order of completion. |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +### Stage 5: Pairwise MCS Computation |
| 118 | + |
| 119 | +For each reactant-product pair *(R_i, P_j)*, compute the Maximum Common Subgraph (MCS) to establish atom correspondences. |
| 120 | + |
| 121 | +#### 5.1 Pre-filtering |
| 122 | + |
| 123 | +Three pre-filters eliminate pairs unlikely to share meaningful substructure: |
| 124 | + |
| 125 | +| Filter | Condition | Rationale | |
| 126 | +|--------|-----------|-----------| |
| 127 | +| **Identity** | Canonical SMILES equality | Identical molecules still require MCS for correct atom indexing | |
| 128 | +| **Size ratio** | `min(atoms_i, atoms_j) / max(atoms_i, atoms_j) < 0.3` | Highly dissimilar sizes indicate unrelated molecules | |
| 129 | +| **Fingerprint** | `Tanimoto(FP_i, FP_j) < 0.05` and both > 5 atoms | Structurally unrelated by path fingerprint comparison | |
| 130 | + |
| 131 | +#### 5.2 Tiered Substructure Matching |
| 132 | + |
| 133 | +For each pair, attempt substructure isomorphism with progressively relaxed matching criteria: |
| 134 | + |
| 135 | +``` |
| 136 | +Tier 1: AtomType=strict, BondOrder=flexible, RingMatch=strict |
| 137 | + | |
| 138 | + | (if no subgraph found) |
| 139 | + v |
| 140 | +Tier 2: AtomType=element-only, BondOrder=flexible, RingMatch=strict |
| 141 | + | |
| 142 | + | (if no subgraph found) |
| 143 | + v |
| 144 | +Tier 3: AtomType=element-only, BondOrder=flexible, RingMatch=relaxed |
| 145 | +``` |
| 146 | + |
| 147 | +Each tier uses VF2++ subgraph isomorphism via the SMSD engine with a 5-second timeout. |
| 148 | + |
| 149 | +#### 5.3 Full MCS Fallback |
| 150 | + |
| 151 | +If no substructure relationship exists (neither molecule is a subgraph of the other), compute the full Maximum Common Subgraph using the SMSD MCS algorithm. |
| 152 | + |
| 153 | +**Optimization:** A `ThreadSafeCache` stores MCS results keyed by canonical SMILES + matcher configuration + circular fingerprint hash. This enables cross-reaction reuse when the same molecule pair appears in different reactions. |
| 154 | + |
| 155 | +#### 5.4 Circular Fingerprint Cache |
| 156 | + |
| 157 | +Each molecule's FCFP (Functional-Class Fingerprint, radius=1, 256 bits) is computed once and cached in an `IdentityHashMap`. The fingerprint hash is included in the MCS cache key to ensure uniqueness. |
| 158 | + |
| 159 | +--- |
| 160 | + |
| 161 | +### Stage 6: Game Theory Matrix Construction |
| 162 | + |
| 163 | +For each algorithm, construct seven scoring matrices of dimension *(|Reactants| x |Products|)*: |
| 164 | + |
| 165 | +| Matrix | Symbol | Formula | Description | |
| 166 | +|--------|--------|---------|-------------| |
| 167 | +| **Clique** | *C(i,j)* | `|MCS(R_i, P_j)|` | Number of atoms in the maximum common subgraph | |
| 168 | +| **Graph Similarity** | *G(i,j)* | `|MCS| / (|R_i| + |P_j| - |MCS|)` | Jaccard index of the MCS | |
| 169 | +| **Stereo** | *S(i,j)* | From SMSD stereochemistry analysis | Stereochemical compatibility score | |
| 170 | +| **Energy** | *E(i,j)* | From SMSD bond energy matrix | Bond dissociation energy of the mapping | |
| 171 | +| **Fragment** | *F(i,j)* | From SMSD fragment analysis | Number of disconnected fragments in the mapping | |
| 172 | +| **Carbon Overlap** | *K(i,j)* | `|{a in MCS : symbol(a) = C}|` | Carbon atoms preserved in the mapping | |
| 173 | +| **FP Similarity** | *T(i,j)* | `Tanimoto(FP_i, FP_j)` | Path fingerprint Tanimoto similarity | |
| 174 | + |
| 175 | +--- |
| 176 | + |
| 177 | +### Stage 7: Algorithm-Specific Winner Selection |
| 178 | + |
| 179 | +Each algorithm iteratively selects the best reactant-product pair from the matrices: |
| 180 | + |
| 181 | +**MAX Algorithm:** |
| 182 | +1. Find pair *(i,j)* with maximum *G(i,j)* that is a **major subgraph** in both row and column (no other pair in the same row or column has a larger clique) |
| 183 | +2. Record mapping, remove pair from matrices |
| 184 | +3. Repeat until all atoms mapped or no valid pairs remain |
| 185 | + |
| 186 | +**MIN Algorithm:** |
| 187 | +1. Find pair *(i,j)* with minimum *F(i,j)* (fewest fragment changes) that is a **minor subgraph** (smallest unique clique in its row or column) |
| 188 | +2. Record mapping, remove pair from matrices |
| 189 | +3. Repeat |
| 190 | + |
| 191 | +**MIXTURE Algorithm:** |
| 192 | +1. Iterations 1-5: use MIN-style selection (parsimony) |
| 193 | +2. Iterations 6+: switch to MAX-style selection (coverage) |
| 194 | + |
| 195 | +**RINGS Algorithm:** |
| 196 | +1. Prioritize pairs where `numberOfCycles(R_i) == numberOfCycles(P_j)` (ring count compatibility) |
| 197 | +2. Use energy matrix *E(i,j)* to break ties |
| 198 | +3. Special handling for ring opening/closing reactions |
| 199 | + |
| 200 | +**Deadlock Resolution:** When multiple pairs have equal scores, a 15-condition decision tree resolves ties using the priority hierarchy: |
| 201 | + |
| 202 | +``` |
| 203 | +Priority 1: Fewer total bond changes (parsimony) |
| 204 | +Priority 2: Fewer molecular fragments |
| 205 | +Priority 3: Lower bond dissociation energy (thermodynamic feasibility) |
| 206 | +Priority 4: More carbon-carbon bonds preserved |
| 207 | +Priority 5: Fewer stereochemical changes |
| 208 | +``` |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +### Stage 8: Cross-Algorithm Solution Selection |
| 213 | + |
| 214 | +After all algorithms complete, select the best overall mapping: |
| 215 | + |
| 216 | +**Primary criterion:** Minimum `localScore = totalBondChanges + totalFragmentChanges` |
| 217 | + |
| 218 | +**Tiebreaker hierarchy (first difference wins):** |
| 219 | + |
| 220 | +| Priority | Criterion | Preference | |
| 221 | +|----------|-----------|------------| |
| 222 | +| 1 | Total bond changes | Fewer | |
| 223 | +| 2 | Fragment changes | Fewer | |
| 224 | +| 3 | Bond dissociation energy | Lower | |
| 225 | +| 4 | Carbon bond changes | Fewer | |
| 226 | +| 5 | Stereochemical changes | Fewer | |
| 227 | +| 6 | Smallest fragment size | Larger (fewer small fragments) | |
| 228 | + |
| 229 | +**Early termination:** If any algorithm produces a mapping with `totalBondChanges <= 2` and `fragmentChanges == 0`, accept immediately without evaluating remaining algorithms. |
| 230 | + |
| 231 | +--- |
| 232 | + |
| 233 | +### Stage 9: Bond Change Annotation & Output |
| 234 | + |
| 235 | +From the selected mapping, enumerate all bond changes: |
| 236 | + |
| 237 | +1. **Bond formed:** Bond exists in products but not in reactants (between mapped atoms) |
| 238 | +2. **Bond cleaved:** Bond exists in reactants but not in products (between mapped atoms) |
| 239 | +3. **Bond order changed:** Bond exists on both sides but with different multiplicity (e.g., single to double) |
| 240 | +4. **Stereochemical change:** E/Z or R/S configuration change at a stereogenic centre |
| 241 | + |
| 242 | +**Bond dissociation energy:** |
| 243 | + |
| 244 | +``` |
| 245 | +totalEnergy = SUM over all changed bonds: count(bond) x BDE(atom1, atom2, order) |
| 246 | +``` |
| 247 | + |
| 248 | +where BDE values are from the Luo (2007) bond dissociation energy reference table. |
| 249 | + |
| 250 | +**Output:** Atom-atom mapping as integer labels (1, 2, 3, ...) assigned to corresponding atoms in reactants and products, plus bond change fingerprints classifying the reaction centre. |
| 251 | + |
| 252 | +--- |
| 253 | + |
| 254 | +## 3. Mathematical Formulations |
| 255 | + |
| 256 | +**Jaccard Similarity (Graph Similarity):** |
| 257 | + |
| 258 | +``` |
| 259 | +G(i,j) = |MCS(R_i, P_j)| / (|R_i| + |P_j| - |MCS(R_i, P_j)|) |
| 260 | +``` |
| 261 | + |
| 262 | +**Tanimoto Fingerprint Similarity:** |
| 263 | + |
| 264 | +``` |
| 265 | +T(A,B) = |A AND B| / (|A| + |B| - |A AND B|) |
| 266 | +``` |
| 267 | + |
| 268 | +where |A| = cardinality of fingerprint bitset A. |
| 269 | + |
| 270 | +**Local Score (primary optimization objective):** |
| 271 | + |
| 272 | +``` |
| 273 | +localScore(mapping) = totalBondChanges(mapping) + totalFragmentChanges(mapping) |
| 274 | +``` |
| 275 | + |
| 276 | +**Bond Energy Sum:** |
| 277 | + |
| 278 | +``` |
| 279 | +E(mapping) = SUM_{b in changedBonds} weight(b) x BDE(atom1(b), atom2(b), order(b)) |
| 280 | +``` |
| 281 | + |
| 282 | +**Reagent Filter Atom-Balance Guard:** |
| 283 | + |
| 284 | +``` |
| 285 | +isNeeded(reactant) = EXISTS element e : |
| 286 | + atomCount(allReactants \ reactant, e) < atomCount(products, e) |
| 287 | +``` |
| 288 | + |
| 289 | +--- |
| 290 | + |
| 291 | +## 4. Key Parameters |
| 292 | + |
| 293 | +| Parameter | Value | Description | |
| 294 | +|-----------|-------|-------------| |
| 295 | +| ECFP radius | 2 | Extended Connectivity Fingerprint radius for reagent filtering | |
| 296 | +| FCFP radius | 1 | Functional-Class Fingerprint radius for cache key generation | |
| 297 | +| Fingerprint size | 256 bits | Bit length for circular fingerprints | |
| 298 | +| Path fingerprint depth | 7 | Maximum path length for structural fingerprints | |
| 299 | +| Path fingerprint size | 1024 bits | Bit length for path-based fingerprints | |
| 300 | +| Reagent Tanimoto threshold | 0.4 | Below this, molecule is candidate for reagent classification | |
| 301 | +| Size ratio filter | 0.3 | Minimum atom count ratio for MCS computation | |
| 302 | +| FP similarity filter | 0.05 | Minimum Tanimoto for MCS computation | |
| 303 | +| Substructure timeout | 5,000 ms | VF2++ subgraph isomorphism timeout | |
| 304 | +| RINGS funnel threshold | 95% | Minimum atom coverage to skip remaining algorithms | |
| 305 | +| MCS cache capacity | 10,000 | Maximum entries in cross-reaction MCS cache | |
| 306 | +| Thread pool size | min(cores-1, 4) | Parallel algorithm execution threads | |
| 307 | +| Max iterations | 100 | Maximum matrix optimization iterations per algorithm | |
| 308 | + |
| 309 | +--- |
| 310 | + |
| 311 | +## 5. Benchmark Results |
| 312 | + |
| 313 | +### Golden Dataset (Lin et al. 2022) |
| 314 | + |
| 315 | +1,851 manually curated reactions with expert-validated atom-atom mappings. |
| 316 | + |
| 317 | +| Tool | Exact Match | Atom Accuracy | Training Data | Deterministic | |
| 318 | +|------|-------------|---------------|---------------|---------------| |
| 319 | +| **RDT v3.7.0** | **82.0%** | **96.4%** | **None** | **Yes** | |
| 320 | +| RXNMapper | 83.74% | - | Unsupervised | No | |
| 321 | +| RDTool (published, 2016) | 76.18% | - | None | Yes | |
| 322 | +| ChemAxon | 70.45% | - | Proprietary | Yes | |
| 323 | + |
| 324 | +### Performance Metrics |
| 325 | + |
| 326 | +| Metric | Value | |
| 327 | +|--------|-------| |
| 328 | +| Mapping success rate | 100% (1,851/1,851) | |
| 329 | +| Bond-change detection | 96.9% | |
| 330 | +| Average quality score | 97.3% | |
| 331 | +| Mapping speed | 3.4 reactions/sec | |
| 332 | +| Test suite | 164 tests, 100% pass | |
| 333 | + |
| 334 | +--- |
| 335 | + |
| 336 | +## 6. Dependencies |
| 337 | + |
| 338 | +| Component | Version | Role | |
| 339 | +|-----------|---------|------| |
| 340 | +| SMSD | 6.7.0 | Substructure and MCS engine (VF2++, circular/path fingerprints) | |
| 341 | +| CDK | 2.12 | Cheminformatics toolkit (molecule parsing, atom types, aromaticity) | |
| 342 | +| Java | 21+ | Runtime platform | |
| 343 | + |
| 344 | +--- |
| 345 | + |
| 346 | +## 7. References |
| 347 | + |
| 348 | +1. Rahman SA, Torrance G, Baldacci L, et al. "Reaction Decoder Tool (RDT): Extracting Features from Chemical Reactions." *Bioinformatics* 32(13):2065-2066, 2016. DOI: [10.1093/bioinformatics/btw096](https://doi.org/10.1093/bioinformatics/btw096) |
| 349 | + |
| 350 | +2. Rahman SA, Cuesta S, Furnham N, et al. "EC-BLAST: a tool to automatically search and compare enzyme reactions." *Nature Methods* 11:171-174, 2014. DOI: [10.1038/nmeth.2803](https://doi.org/10.1038/nmeth.2803) |
| 351 | + |
| 352 | +3. Lin A, Dyubankova N, Madzhidov TI, et al. "Atom-to-atom Mapping: A Benchmarking Study of Popular Mapping Algorithms and Consensus Strategies." *Molecular Informatics* 41(4):e2100138, 2022. DOI: [10.1002/minf.202100138](https://doi.org/10.1002/minf.202100138) |
| 353 | + |
| 354 | +4. Luo YR. "Comprehensive Handbook of Chemical Bond Energies." CRC Press, 2007. |
| 355 | + |
| 356 | +--- |
| 357 | + |
| 358 | +*Reaction Decoder Tool is developed and maintained by BioInception Labs.* |
| 359 | +*Copyright (C) 2003-2026 Syed Asad Rahman. GNU LGPL v3.0.* |
0 commit comments