Analysis Workflows¶
RetroCast includes a suite of analysis scripts designed for rigorous statistical evaluation.
Prerequisites
- Scripts are located in the
scripts/directory - Execute using
uv run scripts/<script-name>.py - Requires scored predictions in
data/4-scored/
Quick Reference¶
| Script | Purpose | Key Output |
|---|---|---|
02-compare.py |
Multi-model comparison on one benchmark | Interactive HTML plots |
03-compare-paired.py |
Statistical significance testing | Confidence intervals for differences |
04-rank.py |
Probabilistic ranking with uncertainty | Rank probability heatmap |
05-tournament.py |
Round-robin model comparison | Win/loss matrix |
06-check-seed-stability.py |
Benchmark subset variance analysis | Forest plot across seeds |
07-create-model-profile.py |
Single model across benchmarks | Cross-benchmark performance |
Directory Structure¶
These scripts rely on the standard RetroCast data directory structure:
data/
├── 1-benchmarks/
│ ├── definitions/ # *.json.gz benchmark definitions
│ └── stocks/ # *.txt stock files (one SMILES per line)
├── 4-scored/ # Scored predictions (output of `retrocast score`)
│ └── <benchmark>/<model>/<stock>/evaluation.json.gz
└── 6-comparisons/ # Output directory for visualizations
Required before running workflows
You must run retrocast score before using these analysis scripts. They require data/4-scored/ to exist.
General Model Comparison¶
Script: 02-compare.py
Use case: Compare multiple models on a single benchmark
Generates comprehensive visual comparison with interactive Plotly HTML files for:
- Overall Solvability
- Stratified Solvability (by route length/depth)
- Top-K accuracy
Usage¶
uv run scripts/02-compare.py \
--benchmark stratified-linear-600 \
--models dms-explorer-xl aizynthfinder-mcts retro-star \
--stock n5-stock
Output¶
data/6-comparisons/stratified-linear-600/
solvability_overall.html- Overall performance comparisonsolvability_stratified.html- Performance by route lengthtopk_accuracy.html- Top-1, Top-5, Top-10 comparison
Paired Hypothesis Testing¶
Script: 03-compare-paired.py
Use case: Statistical significance testing between models
Don't trust overlapping CIs
Comparing overlapping confidence intervals is insufficient for determining if one model outperforms another. Use this script instead.
Performs paired difference test using bootstrap resampling. Calculates 95% confidence interval of the difference:
If the 95% CI for \(\Delta\) does not contain zero, the difference is statistically significant.
Usage¶
uv run scripts/03-compare-paired.py \
--benchmark stratified-linear-600 \
--baseline dms-deep \
--challengers dms-flash dms-wide \
--n-boot 10000 # (1)!
- Number of bootstrap resamples (10,000 recommended for publication)
Output¶
Terminal: Rich-formatted table showing:
- Mean difference (\(\bar{\Delta}\))
- 95% confidence interval
- Significance flag (✅) for Solvability and Top-K metrics
Example:
┌──────────────┬────────────┬──────────────────┬──────────┐
│ Challenger │ Metric │ Δ [95% CI] │ Sig. │
├──────────────┼────────────┼──────────────────┼──────────┤
│ dms-flash │ Top-1 │ +2.3% [0.8, 3.9] │ ✅ │
│ dms-wide │ Top-1 │ -0.5% [-2.1, 1.2]│ │
└──────────────┴────────────┴──────────────────┴──────────┘
Probabilistic Ranking¶
Script: 04-rank.py
Use case: Quantify uncertainty in model rankings
Is Model A really better than Model B?
Ranking by single scalar values (e.g., "45.2% vs 45.1%") is misleading due to statistical noise. This script calculates the probability each model is the true winner.
Performs Monte Carlo simulation:
- Resample the dataset 10,000 times
- Rank models in each sample
- Aggregate results to compute rank probabilities
Usage¶
uv run scripts/04-rank.py \
--benchmark stratified-linear-600 \
--models dms-explorer-xl dms-flash dms-wide \
--metric top-1 # (1)!
- Metric to rank by:
top-1,top-5,top-10, orsolvability
Output¶
Terminal: Table showing:
- Expected rank (E[rank])
- Probability of achieving 1st place (P(rank=1))
File: data/6-comparisons/stratified-linear-600/ranking_heatmap_top-1.html
Interactive heatmap visualizing the full rank probability distribution
Pairwise Tournament¶
Script: 05-tournament.py
Use case: Round-robin comparison to establish model hierarchy
Runs a round-robin tournament where every model is compared against every other model using the paired difference test. Useful for:
- Identifying non-transitive relationships (A > B, B > C, but C > A)
- Establishing hierarchy in a large field of models
Usage¶
uv run scripts/05-tournament.py \
--benchmark stratified-linear-600 \
--models dms-explorer-xl dms-flash dms-wide aizynthfinder-mcts \
--metric top-1
Output¶
Terminal: Matrix table where cell \((i, j)\) shows:
Significant wins (green)
Significant losses (red)
File: data/6-comparisons/stratified-linear-600/pairwise_matrix_top-1.html
Interactive heatmap of tournament results
Seed Stability Analysis¶
Script: 06-check-seed-stability.py
Use case: Validate benchmark subset representativeness
Why does seed matter?
When creating subsets of benchmarks (e.g., 100-route test set from a larger pool), the random seed can influence difficulty. This script quantifies that variance.
Analyzes model performance across many different random seeds of the same benchmark class. Generates forest plot to visualize variance in measured performance caused by dataset selection.
Usage¶
uv run scripts/06-check-seed-stability.py \
--model dms-explorer-xl \
--base-benchmark stratified-linear-600 \
--stock n5-stock \
--seeds 42 299792458 19910806 17760704 20251030 # (1)!
- Test multiple random seeds to measure variance
Output¶
Terminal: Z-scores for each seed
- Identifies which seed produces the most representative (closest to mean) difficulty
File: data/7-meta-analysis/seed_stability.html
Forest plot showing metric distribution across seeds with confidence intervals
Model Profiling¶
Script: 07-create-model-profile.py
Use case: Profile a single model across multiple benchmarks
Inverse of script 02
- Script 02: Many models on one benchmark
- Script 07: One model on many benchmarks
Useful for profiling:
- Sensitivity to route type (linear vs. convergent)
- Performance degradation with route length
- Comparison across different stock definitions
Usage¶
uv run scripts/07-create-model-profile.py \
--model dms-explorer-xl \
--benchmarks ref-lin-600 ref-cnv-400 mkt-lin-500 # (1)!
- Compare the same model across different benchmark types
Output¶
data/6-comparisons/<model-name>/
Standard comparison plots where x-axis or legend groups represent different benchmarks rather than different models
Best Practices¶
Recommended workflow
- Start broad: Use
02-compare.pyto visualize all models - Test significance: Use
03-compare-paired.pyfor top performers - Quantify uncertainty: Use
04-rank.pyto get probabilistic rankings - Deep dive: Use
05-tournament.pyfor comprehensive comparisons - Validate benchmarks: Use
06-check-seed-stability.pywhen creating new subsets - Profile models: Use
07-create-model-profile.pyto understand strengths/weaknesses