CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
1. CellFluxRL addresses a practical failure mode in image-based “virtual cell” generators: samples can look realistic yet violate basic biology (e.g., nuclei appearing outside cytoplasm), limiting downstream use in drug discovery workflows.
2. The core idea is RL post-training of a pretrained perturbation model (CellFlux, a flow-matching approach) using biologically meaningful, mostly non-differentiable evaluators as reward functions—explicitly aligning generation with physical/biological constraints rather than only pixel-level objectives.
3. The paper designs 7 rewards in 3 categories: biological function (mode-of-action consistency), structural validity (nucleus-in-cytoplasm containment; nuclear roundness), and morphology statistics (nucleus/cytoplasm size and counts), combined as a weighted sum with a KL constraint to stay close to the pretrained model and reduce reward hacking.
4. Biological function reward uses a pretrained MoA classifier: reward is the predicted probability of the ground-truth MoA for the applied perturbation, turning “does this look like the right drug class?” into a trainable alignment signal.
5. Structural rewards explicitly enforce cross-channel spatial consistency via segmentation (Cellpose): nucleus-in-cytoplasm penalizes spatial incoherence across nucleus/cytoplasm channels, while a roundness reward matches MoA-conditioned nuclear shape distributions (mean/variance per MoA).
6. Morphology-statistic rewards match MoA-conditioned population statistics: maximum nucleus size, maximum cytoplasm size, nucleus count, and cytoplasm count are scored by normalized deviation from real-image distributions, encouraging correct scale and density rather than only local texture.
7. Optimization uses an online RL method in the DiffusionNFT style adapted to source-to-target flow matching: generate groups of rollouts per (control image x0, perturbation c), normalize rewards within the group into an “optimality probability,” then apply contrastive updates that increase likelihood of high-reward samples and decrease likelihood of low-reward ones.
8. Results on BBBC021 (98K three-channel 96×96 images; 26 perturbations; 12 MoAs) show consistent improvements over CellFlux and prior baselines (PhenDiff, IMPA) across all reward metrics; MoA reward increases (0.26 to 0.34), nucleus containment improves (0.88 to 0.96), and the combined overall reward flips from negative to positive after RL plus selection.
9. The same reward suite enables test-time scaling: best-of-N sampling selects the highest-reward candidate, yielding monotonic gains as N increases; with N=4, MoA reward rises further (0.34 to 0.56) and overall reward improves substantially, illustrating a compute-quality tradeoff at inference.
10. Ablations show single-reward RL improves the targeted metric but transfers poorly; the combined multi-reward objective yields balanced improvements across biological, structural, and morphological criteria. Sensitivity analysis on the KL weight indicates a tradeoff between allowing larger model shifts (better structure/morphology) vs staying closer to the base model (slightly better biological-function reward).
📜Paper:
arxiv.org/abs/2603.21743
#ComputationalBiology #BioimageAnalysis #GenerativeModels #ReinforcementLearning #FlowMatching #VirtualCells #DrugDiscovery #Microscopy #MLforScience