Foundation Models for Biology

Overview

Foundation models for biology are large neural networks pre-trained on biological data — genomics, transcriptomics, imaging, patient records, or combinations thereof — and fine-tuned for downstream scientific and clinical tasks. Unlike text-only large language models, these models must be trained on raw biological data modalities that have no direct natural-language analogue, requiring purpose-built architectures and novel training objectives. The field has validated that scaling laws observed in language modeling also hold in biological domains: NOETIK observed no performance plateau scaling from 20M to 4B parameters, and Xaira reported R²=0.971 on the Replogle-Nadig perturbation prediction dataset.