Small variations in DNA sequence can alter an organism’s response to the environment or susceptibility to disease. Interpreting the impact of genome sequence variation remains a research challenge. Non-coding variants that lie outside of protein-coding regions are particularly difficult to interpret because of the diversity of molecular consequences.
The Google DeepMind team behind Nobel Prize winning AlphaFold has now published AlphaGenome, a DNA sequence model that advances regulatory variant-effect prediction to understand genome function, in Nature.
AlphaGenome’s applications include the identification of new therapeutic targets and design of synthetic DNA with specific regulatory function. The authors state that AlphaGenome is particularly suitable for studying rare variants with potentially large effects, such as those causing rare Mendelian disorders.
“Determining the relevance of different non-coding variants can be extremely challenging, particularly to do at scale,” said Marc Mansour, PhD, a professor at University of College London who focuses on hematological malignancies. “This tool will provide a crucial piece of the puzzle, allowing us to make better connections to understand diseases like cancer.”
AlphaGenome takes a long DNA sequence, up to one million base pairs, as input, and predicts thousands of molecular properties characterizing regulatory activity, including locations of where genes begin and end in different cell types and tissues, splicing sites, RNA production, and DNA accessibility.
This long sequence context is crucial to cover regions that are regulated by distant genes. Previous models had to trade off sequence length and resolution, which limited the range of modalities that could be modeled.
In efficiency, AlphaGenome is reported to score the impact of a genetic variant on a range of molecular properties in one second by comparing predictions of mutated and unmutated sequences.
Many rare genetic diseases, such as spinal muscular atrophy and some forms of cystic fibrosis, can be caused by errors in RNA splicing. Notably, AlphaGenome can explicitly model the location and expression level of these junctions directly from sequence, offering deeper insights about the consequences of genetic variants on RNA splicing.
“For the first time, we have a single model that unifies long-range context, base-level precision and state-of-the-art performance across a whole spectrum of genomic tasks,” said Caleb Lareau, PhD, principal investigator at Memorial Sloan Kettering Cancer Center.
Training data was sourced from large public consortia including ENCODE, GTEX, 4D Nucleome and FANTOM5, which experimentally measured gene regulation properties across hundreds of human and mouse cell types and tissues.
In future work, the authors note that accurately capturing the influence of regulatory elements at large distances, such as those that are 100,000 DNA letters away, remains a limitation for AlphaGenome. Another priority is to improve the model’s ability to capture cell and tissue-specific patterns.
Additionally, AlphaGenome characterizes the performance on individual genetic variants and has not been validated for personal genome prediction. Additionally, while AlphaGenome predicts molecular outcomes, how genetic variations lead to complex traits or diseases involve broader biological processes, such as developmental and environmental factors, that are beyond the direct scope of the model.
DeepMind has made AlphaGenome available for non-commercial research.
