Analysis Workflows¶

RetroCast includes a suite of analysis scripts designed for rigorous statistical evaluation.

Prerequisites

Scripts are located in the scripts/ directory
Execute using uv run scripts/<script-name>.py
Requires scored predictions in data/4-scored/

Quick Reference¶

Script	Purpose	Key Output
`02-compare.py`	Multi-model comparison on one benchmark	Interactive HTML plots
`03-compare-paired.py`	Statistical significance testing	Confidence intervals for differences
`04-rank.py`	Probabilistic ranking with uncertainty	Rank probability heatmap
`05-tournament.py`	Round-robin model comparison	Win/loss matrix
`06-check-seed-stability.py`	Benchmark subset variance analysis	Forest plot across seeds
`07-create-model-profile.py`	Single model across benchmarks	Cross-benchmark performance

Directory Structure¶

These scripts rely on the standard RetroCast data directory structure:

data/
├── 1-benchmarks/
│   ├── definitions/     # *.json.gz benchmark definitions
│   └── stocks/          # *.txt stock files (one SMILES per line)
├── 4-scored/            # Scored predictions (output of `retrocast score`)
│   └── <benchmark>/<model>/<stock>/evaluation.json.gz
└── 6-comparisons/       # Output directory for visualizations

Required before running workflows

You must run retrocast score before using these analysis scripts. They require data/4-scored/ to exist.

General Model Comparison¶

Script: 02-compare.py
Use case: Compare multiple models on a single benchmark

Generates comprehensive visual comparison with interactive Plotly HTML files for:

Overall Solvability
Stratified Solvability (by route length/depth)
Top-K accuracy

Usage¶

scripts/02-compare.py

uv run scripts/02-compare.py \
    --benchmark stratified-linear-600 \
    --models dms-explorer-xl aizynthfinder-mcts retro-star \
    --stock n5-stock

Output¶

data/6-comparisons/stratified-linear-600/

solvability_overall.html - Overall performance comparison
solvability_stratified.html - Performance by route length
topk_accuracy.html - Top-1, Top-5, Top-10 comparison

Paired Hypothesis Testing¶

Script: 03-compare-paired.py
Use case: Statistical significance testing between models

Don't trust overlapping CIs

Comparing overlapping confidence intervals is insufficient for determining if one model outperforms another. Use this script instead.

Performs paired difference test using bootstrap resampling. Calculates 95% confidence interval of the difference:

\[\Delta = \text{Challenger} - \text{Baseline}\]

If the 95% CI for \(\Delta\) does not contain zero, the difference is statistically significant.

Usage¶

scripts/03-compare-paired.py

uv run scripts/03-compare-paired.py \
    --benchmark stratified-linear-600 \
    --baseline dms-deep \
    --challengers dms-flash dms-wide \
    --n-boot 10000 # (1)!

Number of bootstrap resamples (10,000 recommended for publication)

Output¶

Terminal: Rich-formatted table showing:

Mean difference (\(\bar{\Delta}\))
95% confidence interval
Significance flag (✅) for Solvability and Top-K metrics

Example:

┌──────────────┬────────────┬──────────────────┬──────────┐
│ Challenger   │ Metric     │ Δ [95% CI]       │ Sig.     │
├──────────────┼────────────┼──────────────────┼──────────┤
│ dms-flash    │ Top-1      │ +2.3% [0.8, 3.9] │ ✅       │
│ dms-wide     │ Top-1      │ -0.5% [-2.1, 1.2]│          │
└──────────────┴────────────┴──────────────────┴──────────┘

Probabilistic Ranking¶

Script: 04-rank.py
Use case: Quantify uncertainty in model rankings

Is Model A really better than Model B?

Ranking by single scalar values (e.g., "45.2% vs 45.1%") is misleading due to statistical noise. This script calculates the probability each model is the true winner.

Performs Monte Carlo simulation:

Resample the dataset 10,000 times
Rank models in each sample
Aggregate results to compute rank probabilities

Usage¶

scripts/04-rank.py

uv run scripts/04-rank.py \
    --benchmark stratified-linear-600 \
    --models dms-explorer-xl dms-flash dms-wide \
    --metric top-1 # (1)!

Metric to rank by: top-1, top-5, top-10, or solvability

Output¶

Terminal: Table showing:

Expected rank (E[rank])
Probability of achieving 1^st place (P(rank=1))

File: data/6-comparisons/stratified-linear-600/ranking_heatmap_top-1.html

Interactive heatmap visualizing the full rank probability distribution

Pairwise Tournament¶

Script: 05-tournament.py
Use case: Round-robin comparison to establish model hierarchy

Runs a round-robin tournament where every model is compared against every other model using the paired difference test. Useful for:

Identifying non-transitive relationships (A > B, B > C, but C > A)
Establishing hierarchy in a large field of models

Usage¶

scripts/05-tournament.py

uv run scripts/05-tournament.py \
    --benchmark stratified-linear-600 \
    --models dms-explorer-xl dms-flash dms-wide aizynthfinder-mcts \
    --metric top-1

Output¶

Terminal: Matrix table where cell \((i, j)\) shows:

\[\text{Model}_i - \text{Model}_j\]

Significant wins (green)
Significant losses (red)

File: data/6-comparisons/stratified-linear-600/pairwise_matrix_top-1.html

Interactive heatmap of tournament results

Seed Stability Analysis¶

Script: 06-check-seed-stability.py
Use case: Validate benchmark subset representativeness

Why does seed matter?

When creating subsets of benchmarks (e.g., 100-route test set from a larger pool), the random seed can influence difficulty. This script quantifies that variance.

Analyzes model performance across many different random seeds of the same benchmark class. Generates forest plot to visualize variance in measured performance caused by dataset selection.

Usage¶

scripts/06-check-seed-stability.py

uv run scripts/06-check-seed-stability.py \
    --model dms-explorer-xl \
    --base-benchmark stratified-linear-600 \
    --stock n5-stock \
    --seeds 42 299792458 19910806 17760704 20251030 # (1)!

Test multiple random seeds to measure variance

Output¶

Terminal: Z-scores for each seed

Identifies which seed produces the most representative (closest to mean) difficulty

File: data/7-meta-analysis/seed_stability.html

Forest plot showing metric distribution across seeds with confidence intervals

Model Profiling¶

Script: 07-create-model-profile.py
Use case: Profile a single model across multiple benchmarks

Inverse of script 02

Script 02: Many models on one benchmark
Script 07: One model on many benchmarks

Useful for profiling:

Sensitivity to route type (linear vs. convergent)
Performance degradation with route length
Comparison across different stock definitions

Usage¶

scripts/07-create-model-profile.py

uv run scripts/07-create-model-profile.py \
    --model dms-explorer-xl \
    --benchmarks ref-lin-600 ref-cnv-400 mkt-lin-500 # (1)!

Compare the same model across different benchmark types

Output¶

data/6-comparisons/<model-name>/

Standard comparison plots where x-axis or legend groups represent different benchmarks rather than different models

Best Practices¶

Recommended workflow

Start broad: Use 02-compare.py to visualize all models
Test significance: Use 03-compare-paired.py for top performers
Quantify uncertainty: Use 04-rank.py to get probabilistic rankings
Deep dive: Use 05-tournament.py for comprehensive comparisons
Validate benchmarks: Use 06-check-seed-stability.py when creating new subsets
Profile models: Use 07-create-model-profile.py to understand strengths/weaknesses