Skip to content

Training Sets

RetroCast publishes PaRoutes training artifacts as hosted, versioned datasets. You should be able to do one of two things:

  • download a split from the shell without memorizing URLs
  • load a split in Python as list[Route], route records, reaction records, or mapped reaction SMILES

The interface below is the intended product surface.

Quick Start

Shell:

curl -fsSL https://files.ischemist.com/retrocast/get-training-set.sh | bash -s -- reaction-holdout-n1-n5 --split training

Python:

from retrocast.datasets import load_training_set

train_routes = load_training_set(
    "paroutes",
    artifact="reaction-holdout-n1-n5",
    split="training",
    as_="routes",
)

One-step reaction training:

from retrocast.datasets import load_training_set

train_reactions = load_training_set(
    "paroutes",
    artifact="single-step-reaction-holdout-n1-n5",
    split="training",
    as_="reaction_records",
)

Artifacts

Use reaction-holdout-n1-n5 unless you specifically need a route-holdout baseline.

artifact use it for guarantee
route-holdout-n1-n5 multistep route models exact n1 ∪ n5 routes removed
reaction-holdout-n1-n5 multistep route models exact routes removed, then holdout reactions excised
single-step-reaction-holdout-n1-n5 one-step reaction models flattened from reaction-holdout-n1-n5

Route artifacts expose:

  • all.jsonl.gz
  • training.jsonl.gz
  • validation.jsonl.gz
  • manifest.json

Single-step artifacts expose:

  • all.jsonl.gz
  • training.jsonl.gz
  • validation.jsonl.gz
  • all.rsmi.txt.gz
  • training.rsmi.txt.gz
  • validation.rsmi.txt.gz
  • manifest.json

Python API

load_training_set() is the high-level entrypoint:

from retrocast.datasets import load_training_set

val_records = load_training_set(
    "paroutes",
    artifact="reaction-holdout-n1-n5",
    split="validation",
    as_="route_records",
    release="latest",
)

Supported as_ values:

  • routes -> list[Route]
  • route_records -> list[TrainingRouteRecord]
  • reaction_records -> list[TrainingReactionRecord]
  • reaction_smiles -> list[str]

Low-level control stays available:

from retrocast.datasets import download_training_set
from retrocast.io import load_training_route_records, load_training_routes

path = download_training_set(
    "paroutes",
    artifact="reaction-holdout-n1-n5",
    split="training",
    release="latest",
    as_="routes",
)

records = load_training_route_records(path)
routes = load_training_routes(path)

download_training_set():

  • resolve latest
  • download the requested file into a local cache
  • verify the artifact before returning it
  • return the local Path

Use output_dir=... when you want the artifact materialized into an explicit project-owned location instead of the managed cache:

from pathlib import Path

from retrocast.datasets import download_training_set

path = download_training_set(
    "paroutes",
    artifact="reaction-holdout-n1-n5",
    split="training",
    as_="routes",
    output_dir=Path("data/training"),
)

Shell API

The shell path mirrors the Python path. get-training-set.sh defaults to release=latest and prints the local path it downloaded.

Download route records:

curl -fsSL https://files.ischemist.com/retrocast/get-training-set.sh | bash -s -- reaction-holdout-n1-n5 --split training

Download Route JSONL for a pinned release:

curl -fsSL https://files.ischemist.com/retrocast/get-training-set.sh | bash -s -- reaction-holdout-n1-n5 --split validation --release v2026-05-12

Download single-step mapped reaction SMILES:

curl -fsSL https://files.ischemist.com/retrocast/get-training-set.sh | bash -s -- single-step-reaction-holdout-n1-n5 --split training --format reaction-smiles

Materialize into a project directory instead of the default cache:

curl -fsSL https://files.ischemist.com/retrocast/get-training-set.sh | bash -s -- reaction-holdout-n1-n5 --split training --dir data/training

Release Resolution

Users should be able to rely on three things:

  • release="latest" resolves to the newest published release
  • pinned releases like v2026-05-12 remain stable
  • artifact downloads are verified before they are returned to Python or shell callers

Conditions

PaRoutes condition slots remain metadata, not structured solvents/reagents, because the source labels are not trustworthy enough for that. A slot may be a solvent, reagent, mixed bag, or even something that should have been modeled as a reactant.

Single-step records expose both:

  • condition_slot: raw PaRoutes text
  • condition_slot_smiles: best-effort canonicalized SMILES tokens

Use condition_slot_smiles when present. Keep condition_slot when you want the original raw signal.

Local Cache

Downloaded artifacts live in a RetroCast-managed local cache, so callers do not need to choose download destinations or manually de-duplicate files across notebooks, scripts, and trainers.

Default location:

~/.cache/retrocast/training-sets

Both the shell downloader and the Python dataset API use the same cache layout:

~/.cache/retrocast/training-sets/paroutes/<release>/<artifact>/<file>

Override it with:

  • RETROCAST_TRAINING_SET_CACHE_DIR for both shell and Python
  • cache_dir=... in Python when you want per-call control

If you want the downloaded artifact in an explicit project-owned location rather than the shared cache:

  • shell: --dir PATH
  • Python: output_dir=Path(...)