In AI-guided discovery, models often turn huge candidate pools into shortlists for costly validation.
We ask: can we put an error budget on AI-generated shortlists before running the experiment? For example:
• Can we keep failed hits below 10%?
• How many candidates should we test to get enough true positives?
• How far down the list can we go before expecting too many false positives?
• If we already have a fixed top-K list, how many are likely wrong?
📢 Excited to share TxConformal, a framework to turn AI scores into shortlists with controlled/estimated false positives, even in tasks where new candidates differ from past experimental data. This is joint work with amazing
@KexinHuang5 @jure @EmmanuelCandes , in collaboration with Genentech
@nate_diamant @gabo_scalia.
We test it across proteins, genetic perturbations, regulatory DNA, clinical trials, ADMET, and antibacterial virtual screening. In a prospective A. baumannii screen at Genentech, TxConformal estimated 80.3 false positives before wet-lab validation; the experiment found 91, within the 90% CI.
Preprint:
biorxiv.org/content/10.64898…
Code:
github.com/ying531/TxConform…
🧵[1/n] 👇