Evaluation¶
Evaluation scores benchmark-keyed routes against a stock file. If you start from raw planner output, adapt and collect routes first.
Tracking Runtime¶
Measure inference time
from retrocast.utils import ExecutionTimer
timer = ExecutionTimer()
for target in benchmark.targets.values():
with timer.measure(target.id):
raw_output = model.predict(target.smiles)
# ... adapt/store results ...
exec_stats = timer.to_model()
Score Predictions¶
Evaluate routes against stock
from retrocast.api import load_benchmark, load_stock_file, score_predictions
benchmark = load_benchmark("data/1-benchmarks/definitions/mkt-cnv-160.json.gz")
stock = load_stock_file("data/1-benchmarks/stocks/buyables-stock.txt")
# dict[target_id, list[Route]]
predictions = {"target-001": [route1, route2], "target-002": [route3]}
results = score_predictions(
benchmark=benchmark,
predictions=predictions,
stock=stock,
model_name="Experimental-Model-V1",
)
for target_id, evaluation in results.results.items():
print(f"\nTarget: {target_id}")
print(f" Is solvable: {evaluation.is_solvable}")
print(f" Top-1 solved: {evaluation.top_1_is_solved}")
print(f" GT rank: {evaluation.gt_rank}")
print(f" Best route length: {evaluation.best_route_length}")
Predictions must be keyed by benchmark target ID. Each route is evaluated by checking whether all leaves are present in stock and whether the route matches the benchmark ground truth.
Complete Evaluation Sketch¶
Adapt, collect, score
from retrocast import adapt_provider_output, collect_benchmark_predictions, get_adapter, load_benchmark
from retrocast.api import load_stock_file, score_predictions
benchmark = load_benchmark("benchmark.json.gz")
stock = load_stock_file("stock.txt")
adapter = get_adapter("aizynth")
routes = adapt_provider_output(raw_provider_output, adapter)
collected = collect_benchmark_predictions(routes, benchmark)
results = score_predictions(
benchmark=benchmark,
predictions=collected.routes_by_target,
stock=stock,
model_name="my-model",
)