New from
@arcinstitute: Evo 2, the largest (by training compute) biology ML model ever, and one of the largest-ever open source ML models in any category. Evo 2 is a foundation model trained on 9T DNA base pairs that learns a lot of fundamental details about life.
A few examples of why it's cool:
Evo 2 seems to learn about pathogenic human mutations. Despite having just a single human genome in the training set, the model is zero-shot SOTA at predicting harmful BRCA1 mutations. (See attached figure.) I find this to be pretty amazing -- there's some kind of emergent understanding of cross-species biological function, despite fully unsupervised training.
Training a sparse autoencoder on Evo 2 results in a model that learns higher-level features across species, setting the stage for biological mechanistic interpretability. For example, the model appears to learn a general concept of coding regions (parts of the genome that are turned into proteins). This feature activates across human, bacterial, and wooly mammoth DNA (the latter of which wasn't included in the training set), even though the coding regions are represented quite differently across different different parts of the tree of life. We're quite excited about where this can go.
Since it's an autoregressive model, Evo 2 can also help with genomic generation. For example, Evo 2 can be used to generate DNA with tailored chromatin accessibility (i.e. controlling how much of it is likely to be transcribed), unlocking new possibilities in synthetic biology.