Tahoe Therapeutics, Arc Institute, and Biohub have each made a multi-million dollar commitment to fill the massive data gap for virtual cell models. The teams exclusively told GEN Edge that more than 120 million single cell data points across 225,0000 perturbations will be generated using Tahoe’s Mosaic technology for mapping how drug molecules interact with biology.
All three organizations lead a field that builds AI models trained on transcriptome data to predict how cell gene expression changes with cell states. In therapeutics, these virtual cells could gleam insight into new drugs capable of shifting cells from “diseased” to “healthy” with fewer off target effects.Â
Notably, Tahoe, Arc and Biohub have individually contributed key single cell datasets, including Tahoe-100M, scBaseCount, and CELLxGENE, which have fueled models, Tahoe-x1, STATE, and TranscriptFormer.
Tahoe-100M holds the current title for largest perturbation dataset to-date and has achieved over 250,000 downloads since its open-source release last February.Â
The trio’s new dataset will be over four times more perturbation-rich than Tahoe-100M and span approximately 50 cell lines with 1400 diverse chemical scaffolds, at three doses each, and 100 cytokine perturbations. The open-source release will also include metadata on patient relevant contexts.Â
Data will be shared within the partners for an exclusive period before being made open-source for both commercial and non-commercial use. The timeline for dataset release has yet to be made public.Â
Next inflection pointÂ
While today’s virtual cells have made strides predicting cell behavior when generalizing to new biological contexts, the teams assert that large scale perturbation data will be critical to bring the next technological inflection point. In the short term, that means uncovering new scaling laws to guide decisions on where to invest time and resources. In the long term, the field strives to predict clinical outcomes directly.Â
When asked about the origins of the collaboration, Nima Alidoust, PhD, CEO of Tahoe, said the teams had known each other for some time.
“We all talk to each other regularly and have worked for years to build the foundations of the field,” Alidoust told GEN Edge. “Somewhere along the line, particularly after the success and reception of Tahoe-100M, we realized that starting this initiative would be a landmark move to push the data-starved and nascent field of virtual cells forward.”Â
Patricia Brennan, PhD, vice president of technology and general manager for science at Biohub, emphasizes that advances in technology and plummeting sequencing costs have significantly increased the volume of biological datasets over the last several years. Yet, the complexity of biology warrants more data generation at larger scale and deeper representation than currently achievable by the average lab.Â
“This is why we are excited to partner with Tahoe and Arc on this new dataset to expand perturbative diversity, cell-type representation, and patient-relevant context,” Brennan told GEN Edge.Â
Once released, the dataset will be fair game for model training in Arc’s Virtual Cell Challenge, the annual benchmarking competition that recognizes models that can best predict how cells will respond to perturbations. Â
The inaugural 2025 challenge, sponsored by Nvidia, 10x Genomics, and Ultima Genomics, wrapped up in December, and surprisingly awarded two $100,000 grand prizes in response to mid-competition challenges in defining robust metrics for biological relevance. Altos Labs took home the Generalist Prize for the most “well-rounded model.” Â
Ron Alfa, MD, PhD, CEO of Noetik, concurs that it’s great to see more efforts focused on large-scale data generation for virtual cells given that the field is data limited. He cautions that while scale matters, spatial cellular context remains key for training models that learn relevant biological representations.Â
“Large scale, controlled perturbation data is a good community resource for model development and can complement the type of virtual cell work we’re doing at Noetik, building models directly from spatial patient data,” Alfa told GEN Edge.Â
Noetik, a San Francisco-based company, tackles clinical translation for cancer with a human data-first approach for AI models. The company’s proprietary multimodal datasets span spatial proteomics, spatial transcriptomics, H&E (hematoxylin and eosin) pathology, DNA genotyping, and clinical metadata from cancer patient-derived tumors.Â
Just last week, Noetik announced a five-year licensing partnership with GSK, which gives the pharma giant access to Noetik’s non-small cell lung cancer and colorectal cancer foundation models. The deal includes a $50 million upfront payment and will follow a subscription-based framework. The partnership is also among the first and largest transactions monetizing a biological foundation model as a scalable enterprise asset. Â
Mosaic dataÂ
Launched in December 2022, Tahoe originates from the lab of Hani Goodarzi, PhD, core investigator at Arc, and is based on early work on the Mosaic platform by Johnny Yu, PhD, the current CSO of Tahoe, in collaboration with Kevin Shokat, PhD, professor of cellular and molecular pharmacology at the University of California, San Francisco (UCSF). The platform was primarily built for scaling drug perturbations in patient-derived in vivo and organoid models.
As brute force approaches to capture chemical perturbations at single cell resolution in diverse disease models require an infeasible amount of time and cost, Mosaic dramatically reduces the number of required experiments by multiplexing cells from different models into one tumor. The platform deconvolves the effects on each individual cell at scale, lowering the cost of single-cell sequencing by 100 fold.Â
“To build a company with the aim of virtual cell models, Mosaic is key,” asserted Alidoust. “It addresses the main bottleneck, lack of perturbative data on cells from diverse biological contexts.”Â
Mosaic unlocks the chemical counterpart to pooled genetic screens using Perturb-seq, a method that evaluates the effect of CRISPR-induced knockouts on single cell gene expression. While genetic perturbations drive biological target discovery, chemical perturbations enable both target and drug discovery.Â
Last summer, AI-unicorn, Xaira Therapeutics, released the largest Perturb-seq dataset, named X-Atlas/Orion, for non-commercial use to power virtual cell models. X-Atlas/Orion describes dose-dependent genetic effects covering eight million cells across two cancer cell lines. While the Tahoe, Arc, and Biohub partnership asserts a much larger data scale, Alidoust says the combination of the two types of perturbations will be valuable for the field.Â
Looking at model architecture, Goodarzi highlights that hybrid models that bake in prior domain knowledge tend to perform better when training data is too scarce to capture patterns from scratch. Â
“End-to-end models struggle not necessarily because they lack capacity, but because they have not been trained on the right data,” Goodarzi told GEN Edge. By unlocking a new level of diversity and statistical power, he affirms that the data initiative between Tahoe, Arc, and Biohub will be influential for defining benchmarks and enabling models to learn fundamentals of gene regulation.Â
All told, cells are profoundly complex. While each massive dataset makes a leap for novel biology, accurately predicting patient outcomes is still a reach. Yet, the continued momentum of open science is narrowing the gap, bringing biological insight ever closer to clinical impact.
