AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
1. AUTOSCIENTISTS is a decentralized “AI lab team” designed for long-running computational experiments: agents maintain multiple competing hypotheses, run parallel experiments, and keep track of both successes and failures so the search can continue even after early ideas plateau.
2. Core design shift vs prior agentic systems: no central planner and no fixed search-space decomposition. Instead, agents coordinate through a shared experimental state (current champion, full experiment log, shared forum, team queues, and dead-end registries) and self-organize into teams that can be created/merged/split/retired as evidence changes.
3. The workflow alternates between discussion and execution phases. In discussion, agents propose research directions and critique each other’s proposals before spending compute. In execution, teams run experiments in parallel and write results back to shared state; when progress stagnates, agents trigger a new discussion and reorganize.
4. Two persistent roles per team: analyst agents and experiment agents. Analysts audit what has/hasn’t been tried, rank directions using empirical effect sizes from prior runs, enforce diversity/ambition constraints on new proposals, and maintain hypothesis documents. Experiment agents implement diffs, train/evaluate candidates, and log outcomes.
5. To prevent “champion pollution” from stochastic metrics, AUTOSCIENTISTS uses a noise-aware promotion gate: large improvements are accepted directly; small improvements within a measured noise band require confirmation on a second seed; failures and near-misses are still recorded to reduce repeated dead ends.
6. BioML-Bench (24 end-to-end biomedical ML tasks across imaging, drug discovery, protein engineering, single-cell omics): under matched experimental budgets and the same coding-agent backend, AUTOSCIENTISTS reaches 74.4% mean leaderboard percentile, outperforming Autoresearch by 8.33 points; the largest gains are in drug discovery (64.52% vs 46.16%).
7. GPT nanochat training optimization: from the same baseline, AUTOSCIENTISTS reaches a target validation bits-per-byte about 1.9× faster in terms of number of experiments (34 vs 65). Starting from an AUTOSCIENTISTS-discovered champion, it continues improving (7 accepted changes, reaching 0.9730 val_bpb) while the single-agent baseline accepts none over 100 experiments.
8. ProteinGym supervised fitness prediction: starting from the strong Kermut method, AUTOSCIENTISTS discovers an extension that improves ACE2–Spike binding Spearman correlation from 0.747 to 0.840. Freezing the discovered recipe and applying it across all 217 ProteinGym assays improves the official average Spearman correlation from 0.657 to 0.700 ( 6.5% relative).
9. Ablations indicate the gains come from multiple complementary mechanisms rather than one trick: removing analysts, cross-agent feedback, self-organization, or shared state each causes major degradation depending on the task (e.g., self-organization matters most when the productive direction shifts mid-run; shared records matter when avoiding duplicated failures is critical).
💻Code:
github.com/mims-harvard/Auto…
📜Paper:
arxiv.org/abs/2605.28655
#AIAgents #MultiAgentSystems #AutoML #ScientificDiscovery #BiomedicalAI #ProteinEngineering #LLM #MachineLearning #Reproducibility