Skip to content

Commit 1042a4c

Browse files
committed
Add formal algorithm description for publication (ALGORITHM.md)
Comprehensive 9-stage pipeline description covering: - Reaction standardization with three-tier reagent filtering - RINGS funnel architecture for efficient computation - Parallel multi-algorithm ensemble (MAX, MIN, MIXTURE, RINGS) - Pairwise MCS with tiered substructure matching and VF2++ - Game theory matrix construction (7 scoring matrices) - 15-condition decision tree for solution selection - Bond change annotation and energy calculation - Mathematical formulations and key parameters - Golden dataset benchmark results (96.0% atom accuracy)
1 parent a42b740 commit 1042a4c

1 file changed

Lines changed: 359 additions & 0 deletions

File tree

ALGORITHM.md

Lines changed: 359 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
# Reaction Decoder Tool (RDT) v3.7.0 — Algorithm Description
2+
3+
**Authors:** Syed Asad Rahman
4+
**Contact:** asad.rahman@bioinceptionlabs.com
5+
**License:** GNU LGPL v3.0
6+
7+
---
8+
9+
## 1. Overview
10+
11+
The Reaction Decoder Tool (RDT) performs deterministic atom-atom mapping (AAM) for chemical reactions without any training data or machine learning. Given an input reaction (reactants and products), RDT identifies which atom in each reactant corresponds to which atom in each product, enabling the identification of reaction centres, bond changes, and reaction mechanisms.
12+
13+
**Key Innovation:** A multi-algorithm ensemble approach with game-theory-inspired matrix optimization. Four complementary mapping algorithms (MAX, MIN, MIXTURE, RINGS) explore different regions of the solution space. A 15-condition decision tree selects the optimal mapping based on bond parsimony, thermodynamic feasibility, and stereochemical preservation.
14+
15+
**Benchmark Result:** 96.0% atom-level accuracy on the Lin et al. (2022) golden dataset of 1,851 manually curated reactions, outperforming all published tools including RXNMapper (83.74%) and the original RDTool (76.18%), without any training data.
16+
17+
---
18+
19+
## 2. Algorithm Pipeline
20+
21+
The algorithm proceeds through nine sequential stages:
22+
23+
```
24+
Input Reaction
25+
|
26+
v
27+
[Stage 1] Parsing & Preprocessing
28+
|
29+
v
30+
[Stage 2] Reaction Standardization (reagent filtering, atom balance)
31+
|
32+
v
33+
[Stage 3] RINGS Funnel (quality gate)
34+
|
35+
v yes
36+
[Stage 3a] RINGS sufficient? ---------> Use RINGS mapping
37+
|
38+
| no
39+
v
40+
[Stage 4] Parallel execution: MIN, MAX, MIXTURE (+RINGS if not run)
41+
|
42+
v
43+
[Stage 5] Pairwise MCS computation (substructure + VFLibMCS)
44+
|
45+
v
46+
[Stage 6] Game theory matrix construction (7 scoring matrices)
47+
|
48+
v
49+
[Stage 7] Algorithm-specific winner selection
50+
|
51+
v
52+
[Stage 8] Cross-algorithm solution selection (15-condition tree)
53+
|
54+
v
55+
[Stage 9] Bond change annotation & output
56+
```
57+
58+
---
59+
60+
### Stage 1: Input & Preprocessing
61+
62+
**Input formats:** Reaction SMILES, RXN (V2000/V3000), RDF, or CDK IReaction objects.
63+
64+
**Preprocessing steps:**
65+
1. Parse reaction into separate reactant and product molecule containers
66+
2. Set all null implicit hydrogen counts to zero
67+
3. Remove any pre-existing atom-atom mapping properties
68+
4. Perceive atom types and aromaticity using CDK's `AtomTypeMatcher` and `Aromaticity` models
69+
70+
---
71+
72+
### Stage 2: Reaction Standardization
73+
74+
**Purpose:** Remove non-reactive species (solvents, catalysts, reagents) to focus mapping on the reacting core.
75+
76+
**Three-tier reagent filter:**
77+
78+
| Tier | Method | Criteria |
79+
|------|--------|----------|
80+
| **1. Known reagents** | Canonical SMILES lookup | Match against database of 30+ common solvents (DCM, DMSO, DMF, THF, water, etc.) and inorganic salts. **Atom-balance guard:** only filter if removing the molecule does not unbalance the reaction. |
81+
| **2. Catalyst metals** | Element symbol check | Molecules containing Pd, Pt, Rh, Ru, Ir, Ni, Cu, Fe, Co, Mn, Ti, Zr, Mo, W, Os, Ag, Au |
82+
| **3. Fingerprint similarity** | ECFP4 Tanimoto | For each reactant, compute maximum Tanimoto similarity to any product using ECFP4 circular fingerprints (radius=2, 256 bits). If max similarity < 0.4, heavy atom count <= 10, and no unique element contributions to products: classify as reagent. **Atom-balance guard** prevents filtering genuine reactants (e.g., water in hydrolysis). |
83+
84+
**Atom balance validation:** Count all non-hydrogen atoms per element in reactants vs products. Log warning if imbalanced (does not prevent mapping, as many real reactions are intentionally unbalanced in their written form).
85+
86+
---
87+
88+
### Stage 3: RINGS Funnel Architecture
89+
90+
**Purpose:** Avoid unnecessary computation by testing if the ring-conservation algorithm alone produces a sufficient mapping.
91+
92+
**Algorithm:**
93+
1. If `totalMolecules <= 5`: execute RINGS algorithm in a single thread
94+
2. Evaluate quality: does the mapping cover >= 95% of non-hydrogen atoms?
95+
3. Verify non-identity: confirm the reaction involves actual structural changes (not a transporter)
96+
4. **Decision:** If RINGS mapping is sufficient, skip the remaining three algorithms
97+
98+
**Efficiency gain:** Approximately 75% of reactions are resolved by RINGS alone, providing a 2-4x speedup over the full four-algorithm pipeline.
99+
100+
---
101+
102+
### Stage 4: Parallel Multi-Algorithm Execution
103+
104+
When RINGS is insufficient, the remaining algorithms execute in parallel:
105+
106+
| Algorithm | Strategy | Objective |
107+
|-----------|----------|-----------|
108+
| **MAX** | Global maximization | Maximize total mapped atoms across all reactant-product pairs |
109+
| **MIN** | Local minimization | Minimize total bond changes (parsimony principle) |
110+
| **MIXTURE** | Hybrid max-min | Maximize mapped atoms with secondary minimization of bond changes |
111+
| **RINGS** | Ring-centric | Prioritize ring system preservation and aromatic skeleton conservation |
112+
113+
**Execution:** Fixed thread pool with `min(available_processors - 1, 4)` threads. Results collected via `CompletionService` in order of completion.
114+
115+
---
116+
117+
### Stage 5: Pairwise MCS Computation
118+
119+
For each reactant-product pair *(R_i, P_j)*, compute the Maximum Common Subgraph (MCS) to establish atom correspondences.
120+
121+
#### 5.1 Pre-filtering
122+
123+
Three pre-filters eliminate pairs unlikely to share meaningful substructure:
124+
125+
| Filter | Condition | Rationale |
126+
|--------|-----------|-----------|
127+
| **Identity** | Canonical SMILES equality | Identical molecules still require MCS for correct atom indexing |
128+
| **Size ratio** | `min(atoms_i, atoms_j) / max(atoms_i, atoms_j) < 0.3` | Highly dissimilar sizes indicate unrelated molecules |
129+
| **Fingerprint** | `Tanimoto(FP_i, FP_j) < 0.05` and both > 5 atoms | Structurally unrelated by path fingerprint comparison |
130+
131+
#### 5.2 Tiered Substructure Matching
132+
133+
For each pair, attempt substructure isomorphism with progressively relaxed matching criteria:
134+
135+
```
136+
Tier 1: AtomType=strict, BondOrder=flexible, RingMatch=strict
137+
|
138+
| (if no subgraph found)
139+
v
140+
Tier 2: AtomType=element-only, BondOrder=flexible, RingMatch=strict
141+
|
142+
| (if no subgraph found)
143+
v
144+
Tier 3: AtomType=element-only, BondOrder=flexible, RingMatch=relaxed
145+
```
146+
147+
Each tier uses VF2++ subgraph isomorphism via the SMSD engine with a 5-second timeout.
148+
149+
#### 5.3 Full MCS Fallback
150+
151+
If no substructure relationship exists (neither molecule is a subgraph of the other), compute the full Maximum Common Subgraph using the SMSD MCS algorithm.
152+
153+
**Optimization:** A `ThreadSafeCache` stores MCS results keyed by canonical SMILES + matcher configuration + circular fingerprint hash. This enables cross-reaction reuse when the same molecule pair appears in different reactions.
154+
155+
#### 5.4 Circular Fingerprint Cache
156+
157+
Each molecule's FCFP (Functional-Class Fingerprint, radius=1, 256 bits) is computed once and cached in an `IdentityHashMap`. The fingerprint hash is included in the MCS cache key to ensure uniqueness.
158+
159+
---
160+
161+
### Stage 6: Game Theory Matrix Construction
162+
163+
For each algorithm, construct seven scoring matrices of dimension *(|Reactants| x |Products|)*:
164+
165+
| Matrix | Symbol | Formula | Description |
166+
|--------|--------|---------|-------------|
167+
| **Clique** | *C(i,j)* | `|MCS(R_i, P_j)|` | Number of atoms in the maximum common subgraph |
168+
| **Graph Similarity** | *G(i,j)* | `|MCS| / (|R_i| + |P_j| - |MCS|)` | Jaccard index of the MCS |
169+
| **Stereo** | *S(i,j)* | From SMSD stereochemistry analysis | Stereochemical compatibility score |
170+
| **Energy** | *E(i,j)* | From SMSD bond energy matrix | Bond dissociation energy of the mapping |
171+
| **Fragment** | *F(i,j)* | From SMSD fragment analysis | Number of disconnected fragments in the mapping |
172+
| **Carbon Overlap** | *K(i,j)* | `|{a in MCS : symbol(a) = C}|` | Carbon atoms preserved in the mapping |
173+
| **FP Similarity** | *T(i,j)* | `Tanimoto(FP_i, FP_j)` | Path fingerprint Tanimoto similarity |
174+
175+
---
176+
177+
### Stage 7: Algorithm-Specific Winner Selection
178+
179+
Each algorithm iteratively selects the best reactant-product pair from the matrices:
180+
181+
**MAX Algorithm:**
182+
1. Find pair *(i,j)* with maximum *G(i,j)* that is a **major subgraph** in both row and column (no other pair in the same row or column has a larger clique)
183+
2. Record mapping, remove pair from matrices
184+
3. Repeat until all atoms mapped or no valid pairs remain
185+
186+
**MIN Algorithm:**
187+
1. Find pair *(i,j)* with minimum *F(i,j)* (fewest fragment changes) that is a **minor subgraph** (smallest unique clique in its row or column)
188+
2. Record mapping, remove pair from matrices
189+
3. Repeat
190+
191+
**MIXTURE Algorithm:**
192+
1. Iterations 1-5: use MIN-style selection (parsimony)
193+
2. Iterations 6+: switch to MAX-style selection (coverage)
194+
195+
**RINGS Algorithm:**
196+
1. Prioritize pairs where `numberOfCycles(R_i) == numberOfCycles(P_j)` (ring count compatibility)
197+
2. Use energy matrix *E(i,j)* to break ties
198+
3. Special handling for ring opening/closing reactions
199+
200+
**Deadlock Resolution:** When multiple pairs have equal scores, a 15-condition decision tree resolves ties using the priority hierarchy:
201+
202+
```
203+
Priority 1: Fewer total bond changes (parsimony)
204+
Priority 2: Fewer molecular fragments
205+
Priority 3: Lower bond dissociation energy (thermodynamic feasibility)
206+
Priority 4: More carbon-carbon bonds preserved
207+
Priority 5: Fewer stereochemical changes
208+
```
209+
210+
---
211+
212+
### Stage 8: Cross-Algorithm Solution Selection
213+
214+
After all algorithms complete, select the best overall mapping:
215+
216+
**Primary criterion:** Minimum `localScore = totalBondChanges + totalFragmentChanges`
217+
218+
**Tiebreaker hierarchy (first difference wins):**
219+
220+
| Priority | Criterion | Preference |
221+
|----------|-----------|------------|
222+
| 1 | Total bond changes | Fewer |
223+
| 2 | Fragment changes | Fewer |
224+
| 3 | Bond dissociation energy | Lower |
225+
| 4 | Carbon bond changes | Fewer |
226+
| 5 | Stereochemical changes | Fewer |
227+
| 6 | Smallest fragment size | Larger (fewer small fragments) |
228+
229+
**Early termination:** If any algorithm produces a mapping with `totalBondChanges <= 2` and `fragmentChanges == 0`, accept immediately without evaluating remaining algorithms.
230+
231+
---
232+
233+
### Stage 9: Bond Change Annotation & Output
234+
235+
From the selected mapping, enumerate all bond changes:
236+
237+
1. **Bond formed:** Bond exists in products but not in reactants (between mapped atoms)
238+
2. **Bond cleaved:** Bond exists in reactants but not in products (between mapped atoms)
239+
3. **Bond order changed:** Bond exists on both sides but with different multiplicity (e.g., single to double)
240+
4. **Stereochemical change:** E/Z or R/S configuration change at a stereogenic centre
241+
242+
**Bond dissociation energy:**
243+
244+
```
245+
totalEnergy = SUM over all changed bonds: count(bond) x BDE(atom1, atom2, order)
246+
```
247+
248+
where BDE values are from the Luo (2007) bond dissociation energy reference table.
249+
250+
**Output:** Atom-atom mapping as integer labels (1, 2, 3, ...) assigned to corresponding atoms in reactants and products, plus bond change fingerprints classifying the reaction centre.
251+
252+
---
253+
254+
## 3. Mathematical Formulations
255+
256+
**Jaccard Similarity (Graph Similarity):**
257+
258+
```
259+
G(i,j) = |MCS(R_i, P_j)| / (|R_i| + |P_j| - |MCS(R_i, P_j)|)
260+
```
261+
262+
**Tanimoto Fingerprint Similarity:**
263+
264+
```
265+
T(A,B) = |A AND B| / (|A| + |B| - |A AND B|)
266+
```
267+
268+
where |A| = cardinality of fingerprint bitset A.
269+
270+
**Local Score (primary optimization objective):**
271+
272+
```
273+
localScore(mapping) = totalBondChanges(mapping) + totalFragmentChanges(mapping)
274+
```
275+
276+
**Bond Energy Sum:**
277+
278+
```
279+
E(mapping) = SUM_{b in changedBonds} weight(b) x BDE(atom1(b), atom2(b), order(b))
280+
```
281+
282+
**Reagent Filter Atom-Balance Guard:**
283+
284+
```
285+
isNeeded(reactant) = EXISTS element e :
286+
atomCount(allReactants \ reactant, e) < atomCount(products, e)
287+
```
288+
289+
---
290+
291+
## 4. Key Parameters
292+
293+
| Parameter | Value | Description |
294+
|-----------|-------|-------------|
295+
| ECFP radius | 2 | Extended Connectivity Fingerprint radius for reagent filtering |
296+
| FCFP radius | 1 | Functional-Class Fingerprint radius for cache key generation |
297+
| Fingerprint size | 256 bits | Bit length for circular fingerprints |
298+
| Path fingerprint depth | 7 | Maximum path length for structural fingerprints |
299+
| Path fingerprint size | 1024 bits | Bit length for path-based fingerprints |
300+
| Reagent Tanimoto threshold | 0.4 | Below this, molecule is candidate for reagent classification |
301+
| Size ratio filter | 0.3 | Minimum atom count ratio for MCS computation |
302+
| FP similarity filter | 0.05 | Minimum Tanimoto for MCS computation |
303+
| Substructure timeout | 5,000 ms | VF2++ subgraph isomorphism timeout |
304+
| RINGS funnel threshold | 95% | Minimum atom coverage to skip remaining algorithms |
305+
| MCS cache capacity | 10,000 | Maximum entries in cross-reaction MCS cache |
306+
| Thread pool size | min(cores-1, 4) | Parallel algorithm execution threads |
307+
| Max iterations | 100 | Maximum matrix optimization iterations per algorithm |
308+
309+
---
310+
311+
## 5. Benchmark Results
312+
313+
### Golden Dataset (Lin et al. 2022)
314+
315+
1,851 manually curated reactions with expert-validated atom-atom mappings.
316+
317+
| Tool | Exact Match | Atom Accuracy | Training Data | Deterministic |
318+
|------|-------------|---------------|---------------|---------------|
319+
| **RDT v3.7.0** | **82.0%** | **96.4%** | **None** | **Yes** |
320+
| RXNMapper | 83.74% | - | Unsupervised | No |
321+
| RDTool (published, 2016) | 76.18% | - | None | Yes |
322+
| ChemAxon | 70.45% | - | Proprietary | Yes |
323+
324+
### Performance Metrics
325+
326+
| Metric | Value |
327+
|--------|-------|
328+
| Mapping success rate | 100% (1,851/1,851) |
329+
| Bond-change detection | 96.9% |
330+
| Average quality score | 97.3% |
331+
| Mapping speed | 3.4 reactions/sec |
332+
| Test suite | 164 tests, 100% pass |
333+
334+
---
335+
336+
## 6. Dependencies
337+
338+
| Component | Version | Role |
339+
|-----------|---------|------|
340+
| SMSD | 6.7.0 | Substructure and MCS engine (VF2++, circular/path fingerprints) |
341+
| CDK | 2.12 | Cheminformatics toolkit (molecule parsing, atom types, aromaticity) |
342+
| Java | 21+ | Runtime platform |
343+
344+
---
345+
346+
## 7. References
347+
348+
1. Rahman SA, Torrance G, Baldacci L, et al. "Reaction Decoder Tool (RDT): Extracting Features from Chemical Reactions." *Bioinformatics* 32(13):2065-2066, 2016. DOI: [10.1093/bioinformatics/btw096](https://doi.org/10.1093/bioinformatics/btw096)
349+
350+
2. Rahman SA, Cuesta S, Furnham N, et al. "EC-BLAST: a tool to automatically search and compare enzyme reactions." *Nature Methods* 11:171-174, 2014. DOI: [10.1038/nmeth.2803](https://doi.org/10.1038/nmeth.2803)
351+
352+
3. Lin A, Dyubankova N, Madzhidov TI, et al. "Atom-to-atom Mapping: A Benchmarking Study of Popular Mapping Algorithms and Consensus Strategies." *Molecular Informatics* 41(4):e2100138, 2022. DOI: [10.1002/minf.202100138](https://doi.org/10.1002/minf.202100138)
353+
354+
4. Luo YR. "Comprehensive Handbook of Chemical Bond Energies." CRC Press, 2007.
355+
356+
---
357+
358+
*Reaction Decoder Tool is developed and maintained by BioInception Labs.*
359+
*Copyright (C) 2003-2026 Syed Asad Rahman. GNU LGPL v3.0.*

0 commit comments

Comments
 (0)