BioMysteryBench

Bioinformatics research benchmark from Anthropic with 99 problems drawn as anonymized derivatives of public-archive submissions. Each entry has a question, an answer rubric, allowed-domain markers, and a human-solvability flag, intended to evaluate models on real bioinformatics reasoning tasks rather than static QA.

Problem statements, rubrics, and task formulation are released under CC BY 4.0; the bundled data archives (~159 GB) remain subject to their source repositories' policies. Evaluation-only use is permitted; training or distilling models against the benchmark is explicitly prohibited. Released alongside Anthropic's Frontier Safety work on dual-use biology.

HuggingFace (full set)HuggingFace (preview set)

Evaluation Details

Questions 99

Domains 3

Scoring Final-answer accuracy against objective ground-truth answers (rubric-graded on the final answer only, method-agnostic); accuracy averaged over 5 trials per problem

Human baseline 76 of 99 problems are 'human-solvable' (answered correctly by at least one of up to 5 domain experts working from scratch); the remaining 23 are 'human-difficult' (unsolved by any expert), after 4 malformed questions were removed during QC

Domains: DNA/RNA sequencing, proteomics, metabolomics

evalbenchmarksciencesafety

Your notes

Evaluation Details

Related