Concepts and Architecture¶
RetroCast provides a standardized framework for evaluating retrosynthesis models. It addresses the fragmentation of output formats in the field by decoupling the model's internal representation from the evaluation logic.
Core idea
Adapters translate diverse model outputs into a canonical schema, enabling apples-to-apples comparison across any retrosynthesis algorithm. We decouple the model's internal representation from the evaluation logic.
The Core Philosophy: Adapters as an Air Gap¶
Retrosynthesis models produce diverse output formats. Some output bipartite graphs (AiZynthFinder), others output precursor maps (Retro*), and some output recursive dictionaries (DirectMultiStep). Comparing these directly requires writing bespoke evaluation code for every model, which leads to bugs and inconsistent metrics.
RetroCast introduces an adapter layer between the model and the evaluation pipeline:
graph LR
A[Model Output<br/>Native Format] --> B[Adapter<br/>Translation Layer]
B --> C[Canonical Schema<br/>Route Objects]
C --> D[Evaluation Pipeline<br/>Metrics & Analysis]
The flow:
- The Model runs independently and saves output in its native format
- The Adapter reads this format and transforms it into the canonical RetroCast schema
- The Pipeline performs scoring and analysis on canonical objects, unaware of the original format
Why this matters
This architecture ensures that metrics like stock-termination rate and route length are calculated identically for every model.
The Canonical Data Model¶
RetroCast defines a strict, recursive object model in retrocast.models.chem. Structurally, this model is a directed acyclic bipartite graph consisting of alternating molecule nodes and reaction nodes.
A Resolved AND/OR Tree¶
Many retrosynthesis frameworks (e.g., Syntheseus, AiZynthFinder) utilize an AND/OR graph to represent the entire search space. In these graphs, a Molecule node (OR) may have multiple child Reaction nodes (AND), representing competing choices.
The RetroCast Route object represents a resolved instance of this graph: a single, specific pathway, with a Molecule node having at most one child Reaction node.
Schema Definition¶
The schema enforces the minimal information required for rigorous evaluation while allowing for extensibility via metadata dictionaries.
class Route(BaseModel):
"""The root container for a single prediction."""
target: Molecule
rank: int # (1)!
# Provenance
metadata: dict[str, Any]
retrocast_version: str
# Computed Properties
@property
def length(self) -> int: ... # (2)!
@property
def leaves(self) -> set[Molecule]: ... # (3)!
@property
def signature(self) -> str: ... # (4)!
- Model's preference order (populated by adapter)
- Longest path from target to any leaf
- All starting materials in the route
- Cryptographic hash for deduplication
class Molecule(BaseModel):
"""Represents a chemical node (OR node)."""
# Core Identity
smiles: SmilesStr
inchikey: InchiKeyStr # (1)!
# Tree Structure: 0 or 1 reaction step
synthesis_step: ReactionStep | None # (2)!
# Extensibility
metadata: dict[str, Any]
# Computed Properties
@property
def is_leaf(self) -> bool: ...
- Primary ID for hashing/equality
Nonefor leaf molecules (starting materials)
class ReactionStep(BaseModel):
"""Represents a reaction node (AND node)."""
# Tree Structure: N reactant children
reactants: list[Molecule] # (1)!
# Chemical Details (Optional)
mapped_smiles: str | None
template: str | None
reagents: list[str] | None
solvents: list[str] | None
# Extensibility
metadata: dict[str, Any]
# Computed Properties
@property
def is_convergent(self) -> bool: ... # (2)!
- Must have ≥1 reactant molecules
- Returns
Trueif ≥2 reactants are non-leaves
Rationale: Why Bipartite?¶
We deem the bipartite structure, explicitly separating Molecule and ReactionStep, natural for representing multistep routes and it allows precise attribution of data:
- Molecule properties (e.g., "is purchasable", molecular weight) belong to the
Moleculenode - Reaction properties (e.g., template scores, probability, patent IDs) belong to the
ReactionStepnode
Interchange Format¶
You don't need to change your model
Model developers are not required to use this schema internally. RetroCast treats it as an interchange format: the adapter casts your native output into this structure. Extra data (attention weights, search trees, etc.) can be preserved in metadata dictionaries if you need them for downstream analysis.
Data Organization and Lifecycle¶
RetroCast enforces a structured directory layout to manage the transformation of data from raw predictions to final statistics. This structure ensures reproducibility and traceability.
graph TD
A[1-benchmarks<br/>Definitions & Stocks] --> B[2-raw<br/>Model Output]
B --> C[3-processed<br/>Route Objects]
C --> D[4-scored<br/>Evaluated Routes]
D --> E[5-results<br/>Statistics & Reports]
B -.->|retrocast ingest| C
C -.->|retrocast score| D
D -.->|retrocast analyze| E
1. Benchmarks (data/1-benchmarks)¶
Immutable evaluation task definitions.
definitions/: Gzipped JSON files defining targets (IDs and SMILES)stocks/: Text files with available building blocks (one SMILES per line)
2. Raw Data (data/2-raw)¶
Read-only artifacts generated by models.
- Structure:
data/2-raw/<model>/<benchmark>/<filename>
3. Processed Data (data/3-processed)¶
Generated by: retrocast ingest
- Format: Dictionary mapping
target_id→ list ofRouteobjects - Operations: Deduplication, optional sampling (keep first n routes)
4. Scored Data (data/4-scored)¶
Generated by: retrocast score
- Metrics: Routes annotated with boolean flags (e.g.,
is_solved) and ground truth comparisons - Independence: Same routes can be scored against multiple stocks without re-processing
5. Results (data/5-results)¶
Generated by: retrocast analyze
- Statistics: Bootstrap confidence intervals for metrics like Top-K accuracy
- Artifacts: JSON statistics, Markdown reports, HTML visualizations
Provenance and Verification¶
Reproducibility is a primary design goal. RetroCast tracks the lineage of every data artifact using Manifests.
Cryptographic audit trail
Every generated file (e.g., routes.json.gz) has a companion manifest (routes.manifest.json) containing SHA256 hashes of inputs and outputs.
A manifest records:
- Action: The command or script that generated the file
- Inputs: Paths and SHA256 hashes of all source files
- Parameters: Configuration arguments (stock name, random seed, etc.)
- Outputs: Paths and hashes of generated files
Verification¶
The retrocast verify command audits the data pipeline with a two-phase check:
Verification phases
Phase 1: Logical Consistency
Ensures the input hash in a child manifest matches the output hash of the parent manifest
Phase 2: Physical Integrity
Ensures the file on disk matches the hash recorded in its manifest
This system detects:
Data corruption
Manual tampering
Out-of-order execution steps