Will Connell

Will Connell

28 Photos and videos

Tweets

Pinned Tweet

Will Connell @wilstc

14 Jun 2023

🧬🔮 Single cell foundation models have been a recent hot topic in bio-ML! A few of the recent methods and some thoughts 🧬🔮 1) Geneformer 2) scGPT 3) scFoundation 4) Exceiver

252

51,975

Jorge Bravo Abad

Will Connell retweeted

Jorge Bravo Abad

@bravo_abad

Jun 9

No scaling laws for single-cell foundation models: when bigger atlases stop teaching the model anything In language and vision, the recipe has been simple: more data, bigger models, better performance. Single-cell biology borrowed that playbook. Foundation models for transcriptomics jumped from 1 million cells to atlases of over 100 million, on the assumption that scale would unlock the same gains. Alan DenAdel and coauthors put that assumption to the test, and the result is sobering. Working from a 22.2-million-cell corpus, they pretrained 400 models across five architectures (from PCA and a variational autoencoder up to the Geneformer transformer) and ran 6,400 evaluation experiments. They varied not just dataset size (1% to 75%) but also diversity, using cell-type re-weighting and geometric sketching to deliberately enrich rare cell types and transcriptional states. The finding: performance saturates almost immediately. On cell-type classification, batch integration, and perturbation prediction, most models hit their ceiling at roughly 1% of the corpus, about 200,000 cells. Beyond that, adding millions more cells changed essentially nothing. More diversity didn't help. Even spiking in genome-scale Perturb-seq data, to give the models perturbed phenotypes rather than just healthy ones, failed to move the needle. Larger models did score better overall, but they too plateaued early on data. Two points stood out. Simple baselines (PCA, logistic regression) often matched or beat the transformers. And the strongest model, SCimilarity, won not because of size but because its contrastive training objective is aligned with the downstream task. For single-cell data, what you train on and how you frame the objective matters far more than how much you collect. This reframes a quiet but expensive habit. In drug discovery, biotech, and any pipeline leaning on cell atlases, the instinct to keep scaling pretraining corpora may be burning compute for no return. The real leverage sits elsewhere: curating high-quality, task-relevant data and matching the training objective to the actual question you're trying to answer. Paper: DenAdel et al., journal license | doi.org/10.1038/s41592-026-0…

381

95,397

Will Connell

Will Connell @wilstc

Apr 21

Not sure it’s that binary. Previously impossible information synthesis is unlocking many new interpretations …but do you have a framework to verify?

Patrick OShaughnessy

@patrick_oshag

Apr 21

Alex on why AI drug discovery companies need to generate novel data to succeed: "AI models based on the research that's available is a lot of garbage in and garbage out." "A lot of the recorded literature is actually incorrect. There's been tons of studies that show if you go try to replicate the experiments that are in the literature, you don't even get the same results." "The AI companies that I believe are gonna be most set up for success are the companies with a novel way to generate science tokens that don't exist in the public domain."

2:17

223

Will Connell

Will Connell @wilstc

Apr 14

There's a broader pattern here we're also finding success with @transcriptabio: provide structured data as context to elicit prior bio knowledge from an LLM. Here, there are steps of info restructure / distill via probes but worth asking - which are useful? Are they req? 🧐

Goodfire

@GoodfireAI

Apr 14

We achieved state-of-the-art performance in predicting which of 4.2 million genetic variants cause diseases by interpreting a genomics model, in a new preprint with @MayoClinic. We're now releasing an open source database for all variants in the NIH's clinvar database. 🧵(1/8)

333

Mackenzie Morehead

Will Connell retweeted

Mackenzie Morehead

@mackenziejem

Mar 26

New Post: Quantitative Look at Biotech Platforms Plenty has been written about bio platform strategy but no one's put numbers around it We used a whole lotta tokens to compile clinical, partnership financial data on the 100 most successful public platform biotechs of all time

14,070

Leandro von Werra

Will Connell retweeted

Leandro von Werra

@lvwerra

Mar 24

Auto-research for ML training models is all the rage now, but underrated is: auto-research for data! Sure, you can squeeze out a bit of model performance by optimizing hyperparameters, but code agents can do data work that has been very labour intensive and required a lot of attention to a lot details effortlessly: > download data from many different data sources > bring all the data sources into uniform format > do detailed EDA: find patterns and outliers > look at 100s of samples and take detailed notes > make beautiful infographics rather than mpl plots > iterate on data filtering by looking at more samples > make a simple pipelines robust and scalable It's now possible to write data pipelines for dozens of data sources in hours that would have taken weeks of reading many docs, debugging APIs and data formats, wrangling outliers and missing data. A few weeks ago we gave Claude access to the CPU partition of our cluster and it iteratively refined filters to retrieve a domain subset of FineWeb. This would have taken me 2-3 days to work through while it took Claude just a few hours with almost no babysitting and with a nice logbook. Thus the long tail of small, niche data sources becomes more accessible and can be aggregated to even larger high quality datasets for cool applications. Data has been fuelling LLM progress more than model architecture innovations, so I am very excited about this!

275

22,124

Will Connell

Will Connell @wilstc

Mar 18

Great to see this out 👏🏼 plz read for a really big idea Also, I wrote an overview of Variational Synthesis in 2024 open.substack.com/pub/behind…

Nature Biotechnology

@NatureBiotech

Mar 17

Manufacturing-aware generative models enable petascale synthesis of designed DNA go.nature.com/3NxXt1I

526

Will Connell

Will Connell @wilstc

Feb 19

"We find that intra-complex interactions are largely conserved, whereas inter-complex relationships are extensively rewired, revealing new context-dependent genetic dependencies." 👏 💡rich resource for virtual cell benchmarking to disambig contextual-modeling vs coexpr-modeling

LukeGilbert

@LukeGilbertSF

Feb 19

We mapped gene interactions across different environmental conditions (GxGxE) at scale for the first time in human cells. These maps lead to the realization that many genes function in a context dependent manner which provides insight into how humans have relatively few genes but many cell types. Congratulations Ben! Paper: cell.com/molecular-cell/full…

385

Will Connell

Will Connell @wilstc

Feb 18

I built Scaling Biology 🧬 — a dashboard that live-tracks the volume and growth of key biological data sources across genomics, transcriptomics, and proteomics. The project is open to community contributions, check out the repo linked in footer wconnell.github.io/scaling-b…

Scaling Biology

Tracking the growth of biological data across major databases.

wconnell.github.io

127

7,234

Xinming Tu

Will Connell retweeted

Xinming Tu

@TuXinming

Feb 5

1/13 Excited to share our (@anna_spiro @ChikinaLab @sara_mostafavi) latest preprint! 🧬💻 Personal Genome Prediction isn't just a downstream task—it’s the ultimate end-to-end benchmark for Variant Effect Prediction. We put the new SOTA AlphaGenome to the test and uncovered a striking "Modality Gap" between gene expression and chromatin accessibility. 📄 Link: biorxiv.org/content/10.64898… 🧵👇

7,779

Ronghui (Ron) Zhu

Will Connell retweeted

Ronghui (Ron) Zhu @RonZhu2015

Jan 5

Together with Emma Dann, we are thrilled to present a massive new Perturb-seq atlas of 22M primary CD4 T cells, from 4 donors, across 3 timepoints – the result of a decade-long collaboration between the Marson (@MarsonLab) and Pritchard (@jkpritch) labs. 🧵👇

242

39,274

Martin Borch Jensen

Will Connell retweeted

Martin Borch Jensen

@MartinBJensen

20 Nov 2025

The recent breakthroughs from @nablabio & @chaidiscovery emphasize a split in early biotech strategy. For the specific range of problems that antibodies address, making the binder, is becoming trivial. This forces a choice between 'fast but competitive' and 'AI intractable'. 🧵

13,816

Will Connell

Will Connell @wilstc

13 Nov 2025

This is a major reason I joined @transcriptabio We've proved our platform in rare disease – an area that uniquely allows you to: 1) realize the mission of helping people, immediately 2) receive the gold-standard of clinical feedback, immediately 📄: researchsquare.com/article/r…

High-Throughput Drug Discovery for a Rare Neurological Disorder: Uncovering a Novel Therapeutic...

Discovering new and viable therapies for genetic diseases is a time consuming and cost intensive process. This is even more challenging for rare disorders that affect a small fraction of the popula...

researchsquare.com

Dr. Shelby

@shelbynewsad

12 Nov 2025

I will not stop tweeting this until every drug discovery company gets human evidence in 3 years or less

515

Sanju Sinha

Will Connell retweeted

Sanju Sinha @Sanjusinha7

10 Nov 2025

Most current drug discovery efforts is structure-based eg. create small molecules or antibodies that best binds X. However, a drug may not drive its efficacy from its strongest binder. Taking a step away from structure-paradigm, we reason that if a CRISPR knockout of a gene mimics a drug's effects across cancer cell lines, that gene is likely the drug's target. This was done in @EytanRuppin in collaboration with @anideshpandelab and @BenDavidLab Using this principle, we integrated drug and crispr profiles from 1000s of drugs to find their context specific targets (different cancers or when known target is not expressed but drug is yet killing cancer cells). We call this tool DeepTarget. We show that this approach outperforms current structure based methods (AF3, RF, Chai) to find drug's target in a genome-wide search, when we had no information on what the target might be. We benchmarked in eight gold-standard drug-target pairs. It took us months to get this benchmarks (we hope this benchmark helps the field) We present two experimentally validated cases and pls see the paper for this (link at the end). An intriguing observation is that we had many cases where we have many small molecules targeting the same gene (eg. EGFR) and we found that small molecules with higher predicted target specificity show greater clinical advancement. Very happy to hear your feedback. Here's the free access link: nature.com/articles/s41698-0…

195

46,941

Will Connell

Will Connell @wilstc

7 Nov 2025

“this discussion on the challenges of evaluating a Foundation model is more interesting than the challenge itself.” Agreed!

dalloliogm @dalloliogm

6 Nov 2025

My second post on the Arc Virtual Cell Challenge. The challenge’s Discord forums are in turmoil. Some participants have discovered a trick to get to the top of the leaderboard. gmdbioinformatics.substack.c… #arc_virtual_cell_challenge #foundation_models

1,493

Eli Weinstein

Will Connell retweeted

Eli Weinstein @EliWeinstein6

21 Oct 2025

We're excited to present LeaVS, a method to scale up learning for protein function models. It is based on the co-design of wet lab experiments and in silico training.

10,510

Hani Goodarzi

Will Connell retweeted

Hani Goodarzi

@genophoria

6 Oct 2025

Arc is hiring a unique role to lead the Virtual Cell Challenge. In its first year the Challenge has already attracted participation from thousands of top bio AI researchers and support from sponsors like NVIDIA. We need someone to help us make this annual competition historically impactful. The job is a mix of product, program, and community management, with collaboration across a wonderful and talented internal team at Arc. The Virtual Cell Challenge is modeled after CASP, which led to AlphaFold and a Nobel prize, and inspired by our board member Nat Friedman's Scroll Prize, which pushed the boundaries of applied machine learning. The person we hire for this role has an opportunity to make the Challenge something really special.

12,169

Will Connell

Will Connell @wilstc

30 Sep 2025

Massive, clean Pertub-seq dataset. 8M cells, 2 cell types, deeply sequenced. 🧬🪩 👏

Ann Huang @_annhuang

30 Sep 2025

Virtual Cell community - this one's for you! X-Atlas/Orion is now live on Hugging Face. Train your own models with streamlined workflows built into the Hugging Face API. 🔗 HuggingFace: huggingface.co/datasets/Xair… 📜 License: cc-by-nc-sa-4.0

464

Will Connell

Will Connell @wilstc

17 Sep 2025

Awesome 👏🏼👏🏼

Brian Hie @BrianHie

17 Sep 2025

Welcome to the age of generative genome design! In 1977, Sanger et al. sequenced the first genome—of phage ΦX174. Today, led by @samuelhking, we report the first AI-generated genomes. Using ΦX174 as a template, we made novel, high-fitness phages with genome language models. 🧵

190

Will Connell

Will Connell @wilstc

3 Sep 2025

👏👏 last year, I used some metrics from Logan to help me analyze the volume and growth rate of genomics data you can find "Scaling biology: genomics" in the reply

Rayan Chikhi @RayanChikhi

3 Sep 2025

🌎👩‍🔬 For 15 years biology has accumulated petabytes (million gigabytes) of🧬DNA sequencing data🧬 from the far reaches of our planet.🦠🍄🌵 Logan now democratizes efficient access to the world’s most comprehensive genetics dataset. Free and open. doi.org/10.1101/2024.07.30.6…

459

Will Connell

Will Connell @wilstc

3 Sep 2025

behindbioml.substack.com/p/s…