Training Set Releases¶

This page explains how retrocast creates the public PaRoutes training releases. This doc aims to give a compact mental model of the pipeline:

what artifacts we produce
what problem each artifact solves
which functions own each stage
which tradeoffs the current design makes

Historical Context¶

PaRoutes (GitHub) is a landmark first-in-class effort to curate a large-scale, high-quality dataset of multistep synthesis plans extracted from patent literature. The original paper provided three artifacts: full set of multistep routes (all-routes) and two carefully constructed test subsets n1-routes and n5-routes. Unfortunately, the authors did not provide a canonical training set split, which results in either refusal to adopt PaRoutes as training/test set (e.g. DESP) or inconsistent splitting:

DirectMultiStep performs a route holdout: removing n1 and n5 routes from the all-routes
TempRe performs a reaction holdout: removing all single step reactions from n1 and n5 from all-routes

Given the recent inclusion of synthesis planning in the list of Grand Challenges in Drug Discovery and the subsequent proposal of synthesis planning as a pretraining objective, one might reasonably expect an influx of new researchers to the field, and so standardization of the training set preparation is a top priority.

RetroCast is an open-source effort, so we do not release this as an authoritative final say, but rather we invite scrutiny and feedback from the community.

Why does standardization of implementation of the split matters?

Separating test routes from the all-routes is not as straightforward as it might seem because the correct procedure depends on the representation of the data. For example, if you represent routes as nested/bigraph dictionaries, you need to come up with serialization strategy if you want to exclude by containment check (you can't put dictionary in a set). Or even if you're willing to pay the price of O(n) comparison, you still need to make sure your equality check is permutation-invariant (a route A + B -> C (+ D) -> E is the same regardless of whether C or D is the left child). DirectMultiStep implemented generator of all permutations of routes and used flattening serialization. In RetroCast, we utilize route signatures (see more below). The point here is not that this is an unusually algorithmically hard problem, but rather that it requires careful consideration and there's no reason for every single model developer to have to implement this themselves.

Overview¶

RetroCast produces three PaRoutes-derived artifacts from all-routes.json.gz, n1-routes.json.gz, and n5-routes.json.gz:

route-holdout-n1-n5
reaction-holdout-n1-n5
single-step-reaction-holdout-n1-n5

all is the candidate training universe. n1 and n5 are the holdout reference sets.

Route Releases¶

Entrypoint:

scripts/paroutes/training-set-prep/01-create-training-release.py

Implementation:

adapt_training_routes() converts raw PaRoutes dictionaries into RetroCast Route objects and records provenance.
TrainingRouteReleaseBuilder applies holdout, deduplication, and split assignment.
write_training_release() writes the final artifact.

The route release has two modes.

route-holdout-n1-n5 removes every candidate route whose Route.get_structural_signature() appears in n1 ∪ n5.

reaction-holdout-n1-n5 first removes exact holdout routes, then removes holdout reactions from surviving routes. If a route still has valid fragments after excision, those fragments remain candidates.

During adaptation, RetroCast sanity-checks PaRoutes reaction_hash values against ReactionSignature. PaRoutes reaction_hash is reaction SMILES represented with InChIKeys, so it should describe the same identity as RetroCast's reactant/product InChIKey signature. If the check passes, reaction_hash is not carried through the release pipeline.

PaRoutes condition slots stay in step metadata. We do not populate structured solvents or reagents fields because the slot is not reliably one or the other; it can also contain material that should have been modeled as a reactant. RetroCast tries to canonicalize the slot into condition_slot_smiles, but keeps the raw condition_slot when parsing fails or when the raw text may still help an end user.

Route Deduplication¶

Routes are deduplicated twice:

exact annotated chemistry: route.get_annotated_signature(include_mapped_smiles=True)
same route structure and conditions, different atom mapping

The second pass groups by route structural signature plus per-step condition identity. Condition identity is condition_slot_smiles when available and condition_slot otherwise.

When mapped variants collapse, the kept mapped profile is chosen by source support, then lexicographic order, then raw route hash. Non-kept mapped reactions are preserved in step metadata as alternative_mapped_smiles.

Merged route provenance stays in sources. Released route metadata drops the single-source patent_id and writes source_patent_ids at materialization.

Splits And Files¶

Route releases assign training / validation after holdout and deduplication. The split is stratified by route length and whether the route is convergent, using val_fraction and seed from config.

Each route release writes:

all.jsonl.gz
training.jsonl.gz
validation.jsonl.gz
manifest.json

Single-Step Release¶

Entrypoint:

scripts/paroutes/training-set-prep/02-create-single-step-release.py

Implementation:

load_training_route_records() reads the released reaction-holdout-n1-n5 route artifact.
TrainingReactionReleaseBuilder flattens routes into reactions, deduplicates each split, and removes validation reactions that overlap training.
write_training_reaction_release() writes the final artifact.

The single-step release does not re-adapt raw PaRoutes. It derives from the released reaction-holdout-n1-n5 route artifact so the single-step predictor and multistep planner train from compatible data.

Each flattened reaction keeps reactants, product, mapped smiles, alternative mapped smiles, condition metadata, and route-step provenance. Reaction sources store route_id, step_index, and optional PaRoutes source_id; raw route hashes and patent ids stay in the parent route release.

The structured jsonl.gz files are canonical. The *.rsmi.txt.gz files are convenience exports.

Audit¶

Run:

scripts/paroutes/training-set-prep/03-audit-release.py

The audit checks release counts, split balance, route/reaction overlap, and metadata expectations.

Change Points¶

Route release behavior lives in:

src/retrocast/curation/training/route_release.py
src/retrocast/curation/filtering.py
src/retrocast/curation/training/records.py

Single-step release behavior lives in:

src/retrocast/curation/training/reaction_release.py
src/retrocast/curation/training/records.py

PaRoutes adaptation behavior lives in:

src/retrocast/adapters/paroutes_adapter.py