OPI-Struc (STELLA)
datasetOpen Protein Instructions for Structures — the structure-conditioned companion to BAAI's earlier sequence-only OPI dataset. Built for training multimodal LLMs that ground protein-functional reasoning in both sequence and 3D structure, with each sample formatted as a multi-turn conversation containing a <structure> token where the protein structure embedding is inserted.
351,183 training samples and 40,993 test samples (~599 GB) drawn from UniProtKB/Swiss-Prot + AlphaFold DB (function tasks) and the Enzyme Commission dataset + RCSB PDB (enzyme tasks). Two task families: Functional Description Prediction (free-text + multiple choice) and Enzyme-catalyzed Reaction Prediction.
Released as the data backbone for STELLA (arXiv 2506.03800, ACL 2026 Findings) — a multimodal LLM for protein functional annotation via unified sequence-structure encoding. CC BY-NC 4.0.