BioMysteryBench
evalBioinformatics research benchmark from Anthropic with 99 problems drawn as anonymized derivatives of public-archive submissions. Each entry has a question, an answer rubric, allowed-domain markers, and a human-solvability flag, intended to evaluate models on real bioinformatics reasoning tasks rather than static QA.
Problem statements, rubrics, and task formulation are released under CC BY 4.0; the bundled data archives (~159 GB) remain subject to their source repositories' policies. Evaluation-only use is permitted; training or distilling models against the benchmark is explicitly prohibited. Released alongside Anthropic's Frontier Safety work on dual-use biology.
Evaluation Details
Questions 99
Domains 3
Scoring Final-answer accuracy against objective ground-truth answers (rubric-graded on the final answer only, method-agnostic); accuracy averaged over 5 trials per problem
Human baseline 76 of 99 problems are 'human-solvable' (answered correctly by at least one of up to 5 domain experts working from scratch); the remaining 23 are 'human-difficult' (unsolved by any expert), after 4 malformed questions were removed during QC
Domains: DNA/RNA sequencing, proteomics, metabolomics