Training Sets¶
RetroCast publishes PaRoutes training artifacts as hosted, versioned datasets. The official Python workflow is centered on retrocast.datasets. Use retrocast get-training-data when you need a download path outside Python.
Requires the hosted training-set API
This guide assumes a RetroCast build that includes
retrocast.datasets.download_training_set. That surface landed after v0.5.3
in commit 2c6387a.
If your install does not expose retrocast.datasets, upgrade before
running the examples on this page.
Quick Start¶
Pick one of two workflows:
use this when you do not care where the files live beyond “some verified local cache”.
python:
from retrocast.datasets import download_training_set
from retrocast.io import iter_training_routes
path = download_training_set(
"paroutes",
artifact="reaction-holdout-n1-n5",
split="training",
)
for route in iter_training_routes(path):
...
cli:
resulting local layout:
use this when you want to choose the dataset root yourself. pass a dataset-specific directory such as data/datasets/paroutes, and retrocast will place releases directly under it.
python:
from pathlib import Path
from retrocast.datasets import download_training_set
from retrocast.io import iter_training_routes
path = download_training_set(
"paroutes",
artifact="reaction-holdout-n1-n5",
split="training",
output_dir=Path("data/datasets/paroutes"),
)
for route in iter_training_routes(path):
...
cli:
resulting local layout:
One-step reaction training uses the same flow with artifact="single-step-reaction-holdout-n1-n5" or artifact="single-step-route-holdout-n1-n5". By default, that downloads the canonical jsonl artifact. Pass format="rsmi" if you specifically want the plain reaction-smiles text file.
The original PaRoutes n1/n5 test sets are also published as all-only artifacts: n1-routes, n5-routes, n1-single-step-reactions, and n5-single-step-reactions. Use split="all" for those.
When a real download happens in an interactive terminal, RetroCast shows a progress bar automatically. Pass show_progress=False to suppress it or show_progress=True to force it.
Public Imports¶
The stable public import path for training-set download helpers is retrocast.datasets:
Artifact Matrix¶
Use reaction-holdout-n1-n5 unless you specifically need a route-holdout baseline.
| artifact | intended training target | holdout rule | valid format values |
files published per split |
|---|---|---|---|---|
n1-routes |
test set/evaluation routes | original PaRoutes n1 routes adapted to RetroCast Route records |
jsonl |
all.jsonl.gz |
n5-routes |
test set/evaluation routes | original PaRoutes n5 routes adapted to RetroCast Route records |
jsonl |
all.jsonl.gz |
route-holdout-n1-n5 |
multistep route models | remove exact n1 ∪ n5 routes |
jsonl |
all.jsonl.gz, training.jsonl.gz, validation.jsonl.gz |
reaction-holdout-n1-n5 |
multistep route models | remove exact holdout routes, then excise holdout reactions | jsonl |
all.jsonl.gz, training.jsonl.gz, validation.jsonl.gz |
n1-single-step-reactions |
test set/evaluation reactions | flatten original n1 routes, preserving route-step occurrences | jsonl, rsmi |
all.jsonl.gz, all.rsmi.txt.gz |
n5-single-step-reactions |
test set/evaluation reactions | flatten original n5 routes, preserving route-step occurrences | jsonl, rsmi |
all.jsonl.gz, all.rsmi.txt.gz |
single-step-route-holdout-n1-n5 |
one-step reaction models | flatten route-holdout-n1-n5 routes into deduplicated reactions; cross-split reaction overlap is reported, not removed |
jsonl, rsmi |
all.jsonl.gz, training.jsonl.gz, validation.jsonl.gz, all.rsmi.txt.gz, training.rsmi.txt.gz, validation.rsmi.txt.gz |
single-step-reaction-holdout-n1-n5 |
one-step reaction models | flatten reaction-holdout-n1-n5 routes into deduplicated reactions |
jsonl, rsmi |
all.jsonl.gz, training.jsonl.gz, validation.jsonl.gz, all.rsmi.txt.gz, training.rsmi.txt.gz, validation.rsmi.txt.gz |
Each artifact directory includes:
manifest.json
Each release directory includes:
SHA256SUMS
Python API¶
download_training_set() gives you the verified local artifact path. it also materializes sibling manifest.json and release-level SHA256SUMS. format describes which wire file you want:
jsonl-> structured JSONL artifactrsmi-> plain reaction-smiles text artifact
Use jsonl unless you explicitly want the single-step reaction-smiles text projection. route artifacts only support jsonl.
Streaming¶
Use retrocast.io when you want to stream a verified local artifact without loading the full file into memory:
from retrocast.datasets import download_training_set
from retrocast.io import iter_training_reaction_records
path = download_training_set("paroutes", artifact="single-step-reaction-holdout-n1-n5", split="training")
for record in iter_training_reaction_records(path):
...
Available local streaming helpers:
retrocast.io.iter_training_routes(path)retrocast.io.iter_training_route_records(path)retrocast.io.iter_training_reaction_records(path)retrocast.io.iter_training_reaction_smiles(path)
Available local eager helpers:
retrocast.io.load_training_routes(path)retrocast.io.load_training_route_records(path)retrocast.io.load_training_reaction_records(path)retrocast.io.load_training_reaction_smiles(path)
The intended split is:
retrocast.datasetsresolves releases, downloads artifacts, and verifies checksumsretrocast.ioparses eager or streaming views from a local path
That keeps the local artifact path explicit, which is usually useful in real training pipelines.
Local Metadata¶
download_training_set() returns the verified local Path:
from retrocast.datasets import download_training_set
path = download_training_set(
"paroutes",
artifact="reaction-holdout-n1-n5",
split="training",
)
The sibling artifact manifest and release checksum file are always there:
from retrocast.datasets import download_training_set
path = download_training_set(
"paroutes",
artifact="reaction-holdout-n1-n5",
split="training",
)
manifest_path = path.parent / "manifest.json"
checksums_path = path.parent.parent / "SHA256SUMS"
print(path)
print(manifest_path)
print(checksums_path)
That is usually enough for downstream training pipelines:
- keep
pathas the canonical downloaded artifact - inspect
manifest.jsonlater for release provenance and build metadata - compare a local
sha256againstSHA256SUMSif you need an explicit audit step
Release Resolution¶
For the common case, use resolve_latest_training_set_release():
from retrocast.datasets import resolve_latest_training_set_release
release = resolve_latest_training_set_release("paroutes")
resolve_training_set_release() is still available when you want to resolve a specific label yourself:
from retrocast.datasets import resolve_training_set_release
release = resolve_training_set_release(dataset="paroutes", release="latest")
Explicit Output Directories¶
Use output_dir=... when you want a project-owned dataset root instead of the managed cache. this path is treated as the root for one dataset, so releases land directly under it with no extra retrocast/training-sets/<dataset> scaffolding:
from pathlib import Path
from retrocast.datasets import download_training_set
path = download_training_set(
"paroutes",
artifact="reaction-holdout-n1-n5",
split="training",
output_dir=Path("data/datasets/paroutes"),
)
Canonical Wire Examples¶
These are the public on-disk wire formats published by the hosted training-set artifacts.
route_records¶
One row from training.jsonl.gz for a route artifact:
{
"id": "paroutes-reaction-holdout-n1-n5-000001",
"split": "training",
"source": {
"dataset": "all-routes",
"raw_indices": [0],
"raw_route_hashes": ["route-hash-1"],
"patent_ids": ["patent-1"]
},
"route": {
"target": {
"smiles": "cc",
"inchikey": "CCCCCCCCCCCCCC-DDDDDDDDDD-N",
"synthesis_step": {
"reactants": [
{
"smiles": "c",
"inchikey": "AAAAAAAAAAAAAA-BBBBBBBBBB-N",
"synthesis_step": null,
"metadata": {},
"is_leaf": true
}
],
"mapped_smiles": null,
"template": null,
"reagents": null,
"solvents": null,
"metadata": {},
"is_convergent": false
},
"metadata": {},
"is_leaf": false
},
"rank": 1,
"metadata": {},
"retrocast_version": "0.5.4.dev11",
"length": 1,
"leaves": [
{
"smiles": "c",
"inchikey": "AAAAAAAAAAAAAA-BBBBBBBBBB-N",
"synthesis_step": null,
"metadata": {},
"is_leaf": true
}
],
"has_convergent_reaction": false,
"content_hash": "2de9f081bc2f5f85de358ecae045a35744de7ad89effe42a5c177c7d4dda5478",
"signature": "d4fd40da3b1046814438c1c24cc56331747af3ed578ec8244d57b8793a45c1b6"
}
}
routes¶
routes is the nested route object from a route_records row, validated as a Route. The hosted file is still the same *.jsonl.gz artifact shown above.
reaction_records¶
One row from training.jsonl.gz for the single-step artifact:
{
"id": "paroutes-rxn-000001",
"split": "training",
"reactants": ["c"],
"product": "cc",
"mapped_smiles": "c>o>cc",
"alternative_mapped_smiles": [],
"condition_slot": "o",
"condition_slot_smiles": ["o"],
"sources": [
{
"route_id": "paroutes-reaction-holdout-n1-n5-000001",
"step_index": 0,
"source_id": null
}
]
}
reaction_smiles¶
One line from training.rsmi.txt.gz:
Local Layout¶
Default cache root:
Shared cache layout:
~/.cache/retrocast/training-sets/paroutes/<release>/SHA256SUMS
~/.cache/retrocast/training-sets/paroutes/<release>/<artifact>/manifest.json
~/.cache/retrocast/training-sets/paroutes/<release>/<artifact>/<file>
Concrete example:
~/.cache/retrocast/training-sets/paroutes/v2026-05-12/SHA256SUMS
~/.cache/retrocast/training-sets/paroutes/v2026-05-12/reaction-holdout-n1-n5/manifest.json
~/.cache/retrocast/training-sets/paroutes/v2026-05-12/reaction-holdout-n1-n5/training.jsonl.gz
When you pass output_dir=Path("data/datasets/paroutes"), the resulting path is:
data/datasets/paroutes/v2026-05-12/SHA256SUMS
data/datasets/paroutes/v2026-05-12/reaction-holdout-n1-n5/manifest.json
data/datasets/paroutes/v2026-05-12/reaction-holdout-n1-n5/training.jsonl.gz
Override the shared cache with:
RETROCAST_CACHE_DIRfor both the CLI and Pythoncache_dir=...in Python for per-call control
Use cache_dir when you want to relocate the shared cache root but still keep the managed paroutes/<release>/... structure. Use output_dir when you want to own the dataset root yourself.
CLI API¶
The CLI mirrors the Python surface for release downloads.
Download route JSONL:
Download Route JSONL for a pinned release:
Download single-step mapped reaction SMILES:
Materialize into a project directory instead of the default cache:
Conditions¶
PaRoutes condition slots remain metadata, not structured solvents or reagents, because the source labels are not trustworthy enough for that. A slot may be a solvent, reagent, mixed bag, or even something that should have been modeled as a reactant.
Single-step records expose both:
condition_slot: raw PaRoutes textcondition_slot_smiles: best-effort canonicalized SMILES tokens
Use condition_slot_smiles when present. Keep condition_slot when you want the original raw signal.