An AI model trained on tens of thousands of species can understand genetic code as never before, offering vast potential for precision medicine.
Evo 2, published in Nature, could help pinpoint disease-causing genetic variants, inform therapeutic design, and potentially even reshape what is defined as actionable genetic risk.
The DNA foundation model makes sense of the imprint left by evolution on biological sequences, enabling it to read, write and think in genetic code.
Trained on 9.3 trillion nucleotides from more than 128,000 species, it is the largest, fully open-source AI model of its kind so far.
The resource has been made freely available by its co-creators, through the non-profit research Arc Institute’s GitHub and via U.S. tech company Nvidia’s BioNeMo Framework.
The 40-billion parameter model offers the potential to understand the impact of genetic variants, spotting patterns across disparate organisms that would take years for researchers to discover.
This makes it an immensely flexible tool, with applications that range from predicting disease-causing mutations to designing artificial life.
It has already been used to create functional, synthetic bacteriophages—specialist viruses that can infect and replicate only in bacteria—which offer the potential to treat antibiotic-resistant microbes.
In addition, Evo 2 has achieved over 90% accuracy in predicting harmful BRCA1 gene variants and could help clarify the significance of many variants with uncertain significance.
The model, which processes up to one million base pairs in a single context window, could also streamline early-stage therapeutic developmental timelines.
Incorporating the logic of gene regulation across thousands of species, it gives a new starting point for designing modulatory elements, potentially identifying the most promising from a bank of candidates.
“If you have a gene therapy that you want to turn on only in neurons to avoid side effects, or only in liver cells, you could design a genetic element that is only accessible in those specific cells,” explained researcher Hani Goodarzi, PhD, from the University of California at San Francisco.
“This precise control could help develop more targeted treatments with fewer side effects.”
Human pathogens were excluded from the training data as a biosafety measure, so that the model would not return productive queries about these.
Ultimately, the research team sees Evo2 as a foundation onto which more specific AI tools could be built.
“In a loose way, you can think of the model almost like an operating system kernel—you can have all of these different applications that are built on top of it,” said co-author David Burke, PhD, from the Arc Institute.
“From predicting how single DNA mutations affect a protein’s function to designing genetic elements that behave differently in different cell types, as we continue to refine the model and researchers begin using it in creative ways, we expect to see beneficial uses for Evo 2 we haven’t even imagined yet.”
