Joined March 2016
28 Photos and videos
Pinned Tweet
14 Jun 2023
🧬🔮 Single cell foundation models have been a recent hot topic in bio-ML! A few of the recent methods and some thoughts 🧬🔮 1) Geneformer 2) scGPT 3) scFoundation 4) Exceiver
4
59
252
51,975
Will Connell retweeted
No scaling laws for single-cell foundation models: when bigger atlases stop teaching the model anything In language and vision, the recipe has been simple: more data, bigger models, better performance. Single-cell biology borrowed that playbook. Foundation models for transcriptomics jumped from 1 million cells to atlases of over 100 million, on the assumption that scale would unlock the same gains. Alan DenAdel and coauthors put that assumption to the test, and the result is sobering. Working from a 22.2-million-cell corpus, they pretrained 400 models across five architectures (from PCA and a variational autoencoder up to the Geneformer transformer) and ran 6,400 evaluation experiments. They varied not just dataset size (1% to 75%) but also diversity, using cell-type re-weighting and geometric sketching to deliberately enrich rare cell types and transcriptional states. The finding: performance saturates almost immediately. On cell-type classification, batch integration, and perturbation prediction, most models hit their ceiling at roughly 1% of the corpus, about 200,000 cells. Beyond that, adding millions more cells changed essentially nothing. More diversity didn't help. Even spiking in genome-scale Perturb-seq data, to give the models perturbed phenotypes rather than just healthy ones, failed to move the needle. Larger models did score better overall, but they too plateaued early on data. Two points stood out. Simple baselines (PCA, logistic regression) often matched or beat the transformers. And the strongest model, SCimilarity, won not because of size but because its contrastive training objective is aligned with the downstream task. For single-cell data, what you train on and how you frame the objective matters far more than how much you collect. This reframes a quiet but expensive habit. In drug discovery, biotech, and any pipeline leaning on cell atlases, the instinct to keep scaling pretraining corpora may be burning compute for no return. The real leverage sits elsewhere: curating high-quality, task-relevant data and matching the training objective to the actual question you're trying to answer. Paper: DenAdel et al., journal license | doi.org/10.1038/s41592-026-0…
15
93
381
95,397
Not sure it’s that binary. Previously impossible information synthesis is unlocking many new interpretations …but do you have a framework to verify?
Alex on why AI drug discovery companies need to generate novel data to succeed: "AI models based on the research that's available is a lot of garbage in and garbage out." "A lot of the recorded literature is actually incorrect. There's been tons of studies that show if you go try to replicate the experiments that are in the literature, you don't even get the same results." "The AI companies that I believe are gonna be most set up for success are the companies with a novel way to generate science tokens that don't exist in the public domain."
223
There's a broader pattern here we're also finding success with @transcriptabio: provide structured data as context to elicit prior bio knowledge from an LLM. Here, there are steps of info restructure / distill via probes but worth asking - which are useful? Are they req? 🧐
We achieved state-of-the-art performance in predicting which of 4.2 million genetic variants cause diseases by interpreting a genomics model, in a new preprint with @MayoClinic. We're now releasing an open source database for all variants in the NIH's clinvar database. 🧵(1/8)
3
333
Will Connell retweeted
New Post: Quantitative Look at Biotech Platforms Plenty has been written about bio platform strategy but no one's put numbers around it We used a whole lotta tokens to compile clinical, partnership financial data on the 100 most successful public platform biotechs of all time
2
9
51
14,070
Will Connell retweeted
Auto-research for ML training models is all the rage now, but underrated is: auto-research for data! Sure, you can squeeze out a bit of model performance by optimizing hyperparameters, but code agents can do data work that has been very labour intensive and required a lot of attention to a lot details effortlessly: > download data from many different data sources > bring all the data sources into uniform format > do detailed EDA: find patterns and outliers > look at 100s of samples and take detailed notes > make beautiful infographics rather than mpl plots > iterate on data filtering by looking at more samples > make a simple pipelines robust and scalable It's now possible to write data pipelines for dozens of data sources in hours that would have taken weeks of reading many docs, debugging APIs and data formats, wrangling outliers and missing data. A few weeks ago we gave Claude access to the CPU partition of our cluster and it iteratively refined filters to retrieve a domain subset of FineWeb. This would have taken me 2-3 days to work through while it took Claude just a few hours with almost no babysitting and with a nice logbook. Thus the long tail of small, niche data sources becomes more accessible and can be aggregated to even larger high quality datasets for cool applications. Data has been fuelling LLM progress more than model architecture innovations, so I am very excited about this!
11
30
275
22,124
Great to see this out 👏🏼 plz read for a really big idea Also, I wrote an overview of Variational Synthesis in 2024 open.substack.com/pub/behind…

Manufacturing-aware generative models enable petascale synthesis of designed DNA go.nature.com/3NxXt1I
3
526
"We find that intra-complex interactions are largely conserved, whereas inter-complex relationships are extensively rewired, revealing new context-dependent genetic dependencies." 👏 💡rich resource for virtual cell benchmarking to disambig contextual-modeling vs coexpr-modeling
We mapped gene interactions across different environmental conditions (GxGxE) at scale for the first time in human cells. These maps lead to the realization that many genes function in a context dependent manner which provides insight into how humans have relatively few genes but many cell types. Congratulations Ben! Paper: cell.com/molecular-cell/full…
1
385
I built Scaling Biology 🧬 — a dashboard that live-tracks the volume and growth of key biological data sources across genomics, transcriptomics, and proteomics. The project is open to community contributions, check out the repo linked in footer wconnell.github.io/scaling-b…
2
17
127
7,234
Will Connell retweeted
1/13 Excited to share our (@anna_spiro @ChikinaLab @sara_mostafavi) latest preprint! 🧬💻 Personal Genome Prediction isn't just a downstream task—it’s the ultimate end-to-end benchmark for Variant Effect Prediction. We put the new SOTA AlphaGenome to the test and uncovered a striking "Modality Gap" between gene expression and chromatin accessibility. 📄 Link: biorxiv.org/content/10.64898… 🧵👇

1
27
87
7,779
Will Connell retweeted
Together with Emma Dann, we are thrilled to present a massive new Perturb-seq atlas of 22M primary CD4 T cells, from 4 donors, across 3 timepoints – the result of a decade-long collaboration between the Marson (@MarsonLab) and Pritchard (@jkpritch) labs. 🧵👇
4
52
242
39,274
Will Connell retweeted
The recent breakthroughs from @nablabio & @chaidiscovery emphasize a split in early biotech strategy. For the specific range of problems that antibodies address, making the binder, is becoming trivial. This forces a choice between 'fast but competitive' and 'AI intractable'. 🧵
4
11
92
13,816
13 Nov 2025
This is a major reason I joined @transcriptabio We've proved our platform in rare disease – an area that uniquely allows you to: 1) realize the mission of helping people, immediately 2) receive the gold-standard of clinical feedback, immediately 📄: researchsquare.com/article/r…
I will not stop tweeting this until every drug discovery company gets human evidence in 3 years or less
2
515
Will Connell retweeted
Most current drug discovery efforts is structure-based eg. create small molecules or antibodies that best binds X. However, a drug may not drive its efficacy from its strongest binder. Taking a step away from structure-paradigm, we reason that if a CRISPR knockout of a gene mimics a drug's effects across cancer cell lines, that gene is likely the drug's target. This was done in @EytanRuppin in collaboration with @anideshpandelab and @BenDavidLab Using this principle, we integrated drug and crispr profiles from 1000s of drugs to find their context specific targets (different cancers or when known target is not expressed but drug is yet killing cancer cells). We call this tool DeepTarget. We show that this approach outperforms current structure based methods (AF3, RF, Chai) to find drug's target in a genome-wide search, when we had no information on what the target might be. We benchmarked in eight gold-standard drug-target pairs. It took us months to get this benchmarks (we hope this benchmark helps the field) We present two experimentally validated cases and pls see the paper for this (link at the end). An intriguing observation is that we had many cases where we have many small molecules targeting the same gene (eg. EGFR) and we found that small molecules with higher predicted target specificity show greater clinical advancement. Very happy to hear your feedback. Here's the free access link: nature.com/articles/s41698-0…
9
44
195
46,941
7 Nov 2025
“this discussion on the challenges of evaluating a Foundation model is more interesting than the challenge itself.” Agreed!
My second post on the Arc Virtual Cell Challenge. The challenge’s Discord forums are in turmoil. Some participants have discovered a trick to get to the top of the leaderboard. gmdbioinformatics.substack.c… #arc_virtual_cell_challenge #foundation_models
1
6
1,493
Will Connell retweeted
We're excited to present LeaVS, a method to scale up learning for protein function models. It is based on the co-design of wet lab experiments and in silico training.
4
11
52
10,510
Will Connell retweeted
Arc is hiring a unique role to lead the Virtual Cell Challenge. In its first year the Challenge has already attracted participation from thousands of top bio AI researchers and support from sponsors like NVIDIA. We need someone to help us make this annual competition historically impactful. The job is a mix of product, program, and community management, with collaboration across a wonderful and talented internal team at Arc. The Virtual Cell Challenge is modeled after CASP, which led to AlphaFold and a Nobel prize, and inspired by our board member Nat Friedman's Scroll Prize, which pushed the boundaries of applied machine learning. The person we hire for this role has an opportunity to make the Challenge something really special.
5
13
94
12,169
30 Sep 2025
Massive, clean Pertub-seq dataset. 8M cells, 2 cell types, deeply sequenced. 🧬🪩 👏
30 Sep 2025
Virtual Cell community - this one's for you! X-Atlas/Orion is now live on Hugging Face. Train your own models with streamlined workflows built into the Hugging Face API. 🔗 HuggingFace: huggingface.co/datasets/Xair… 📜 License: cc-by-nc-sa-4.0
2
464
17 Sep 2025
Awesome 👏🏼👏🏼
17 Sep 2025
Welcome to the age of generative genome design! In 1977, Sanger et al. sequenced the first genome—of phage ΦX174. Today, led by @samuelhking, we report the first AI-generated genomes. Using ΦX174 as a template, we made novel, high-fitness phages with genome language models. 🧵
190
3 Sep 2025
👏👏 last year, I used some metrics from Logan to help me analyze the volume and growth rate of genomics data you can find "Scaling biology: genomics" in the reply
🌎👩‍🔬 For 15 years biology has accumulated petabytes (million gigabytes) of🧬DNA sequencing data🧬 from the far reaches of our planet.🦠🍄🌵 Logan now democratizes efficient access to the world’s most comprehensive genetics dataset. Free and open. doi.org/10.1101/2024.07.30.6…
1
8
459