Biology AI Daily

Biology AI Daily

Users
Tweets

Jun 10

Flexible Kernels for Protein Property Prediction 1. The paper introduces LOCK-GP: Gaussian processes with a new protein sequence kernel that combines evolutionary substitution matrices (e.g., BLOSUM) with an explicit “local linearity” inductive bias to model protein property landscapes from sparse experimental data. 2. Key kernel idea (LOCK: Locally Linear Correlation Kernel): replace one-hot “same/different” comparisons with amino-acid similarity from substitution matrices, and learn landscape-specific Hadamard-power exponents to tune how strongly similarities are amplified/attenuated while preserving kernel validity. 3. A central technical observation: many BLOSUM matrices are not only PSD but also infinitely divisible, so elementwise exponentiation by any positive power preserves PSD. This enables learnable exponents inside the GP kernel without breaking positive semidefiniteness. 4. LOCK is built from (i) an additive “linear” correlation kernel and (ii) a multiplicative “RBF-like” correlation kernel, then combined so predictions are nuanced and non-linear near training data but revert to a robust linear predictor farther away (avoiding both aggressive linear extrapolation and mean-reversion to the prior mean). 5. Benchmarking is extensive: 30 predictors evaluated across 21 protein property datasets (thermostability, binding affinity, fluorescence, capsid viability, etc.) under three regimes: i.i.d. CV, Hamming-distance extrapolation, and an “unseen mutations” OOD regime where test sequences include mutations absent from training. 6. Results highlight data efficiency and uncertainty quality: across datasets and training sizes (e.g., 48–1536 points), LOCK-GP is typically best or near-best on correlation and error metrics, and shows strong calibration via proper scoring rules like CRPS; uncertainty improves notably when local linearity is included. 7. A notable empirical takeaway: a sequence-only LOCK-GP that relies on a small substitution-matrix prior can frequently outperform or match baselines that depend on large foundation models (e.g., ESM-2 embeddings, structure features, ProteinMPNN-derived features), especially in extrapolation and OOD “unseen mutation” settings where high-dimensional embeddings can be fragile. 8. The paper generalizes LOCK to CLOCK (structure-conditioned LOCK): positional structure embeddings from a foundation model are mapped to position-specific amino-acid correlation matrices (parameterized as exp(-||z_a - z_a'||^2)), effectively learning structure-aware substitution behavior that can be used “zero-shot” as a kernel prior and then refined by GP training. 9. Multi-task learning: CLOCK-GP is trained across 371 thermostability landscapes (Tsuboyama et al.), showing that learning a shared, structure-conditioned kernel across landscapes yields strong performance; CLOCK-GP is especially competitive in low-landscape regimes (e.g., training on 10 landscapes), and learned correlations are interpretable (e.g., proline preferences near helix N-termini; arginine favored on surfaces vs cores). 10. Additional demonstrations: LOCK-GP supports GP-based Bayesian optimization via Thompson sampling to control exploration/diversity in design, and extends to binary classification (e.g., quantized fluorescence) with strong accuracy scaling with dataset size. 💻Code: github.com/generatebio/lock_… 📜Paper: arxiv.org/abs/2606.11057 #ComputationalBiology #ProteinEngineering #GaussianProcesses #MachineLearning #Kernels #ProteinDesign #UncertaintyQuantification #MultiTaskLearning #FoundationModels #Bioinformatics

1,110

Zhengzhong Tu

Zhengzhong Tu

@_vztu

May 27

🎉 Excited to share our new work accepted to #CVPR2026 “𝗡𝗲𝘅𝘂𝘀𝗙𝗹𝗼𝘄: 𝗨𝗻𝗶𝗳𝘆𝗶𝗻𝗴 𝗗𝗶𝘀𝗽𝗮𝗿𝗮𝘁𝗲 𝗧𝗮𝘀𝗸𝘀 𝘂𝗻𝗱𝗲𝗿 𝗣𝗮𝗿𝘁𝗶𝗮𝗹 𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗶𝗼𝗻 𝘃𝗶𝗮 𝗜𝗻𝘃𝗲𝗿𝘁𝗶𝗯𝗹𝗲 𝗙𝗹𝗼𝘄 𝗡𝗲𝘁𝘄𝗼𝗿𝗸𝘀” In textbooks and benchmarks, datasets are often neatly annotated for every task. In the real world, they rarely are. Data is collected at different times, in different places, and for different purposes. One dataset may contain labels for mapping, another for tracking, another for depth or segmentation. Does that mean fragmented data has to be discarded? 💪 𝗢𝘂𝗿 𝗮𝗻𝘀𝘄𝗲𝗿: 𝗻𝗼. We show that partially supervised, heterogeneous data can still be highly valuable—and in some cases, can even outperform fully annotated data. How do we learn across structurally different tasks when labels are only partially available? 💡 𝗢𝘂𝗿 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: 𝗡𝗲𝘅𝘂𝘀𝗙𝗹𝗼𝘄 NexusFlow is a lightweight, plug-and-play framework that aligns disparate tasks in a shared latent space. What makes it work: • 🔄 𝗜𝗻𝘃𝗲𝗿𝘁𝗶𝗯𝗹𝗲 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁. Invertible coupling layers map task features into a unified canonical space. Since the mapping is bijective, task information is preserved, helping avoid the representational collapse often seen in vanilla alignment methods. • 🔌 𝗣𝗹𝘂𝗴-𝗮𝗻𝗱-𝗽𝗹𝗮𝘆 𝗱𝗲𝘀𝗶𝗴𝗻. No need to modify task heads or losses. NexusFlow can be added to BEV-based backbones with a simple alignment loss. • 📈 𝗦𝗰𝗮𝗹𝗮𝗯𝗹𝗲 𝘁𝗼 𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝘁𝗮𝘀𝗸𝘀. The method scales as O(N) with one surrogate branch per task, making extension to 3 tasks straightforward. • 📐 𝗧𝗵𝗲𝗼𝗿𝗲𝘁𝗶𝗰𝗮𝗹 𝗴𝗿𝗼𝘂𝗻𝗱𝗶𝗻𝗴. Invertibility provides a provable bound that connects the alignment loss to cross-task knowledge transfer. 🏆 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 NexusFlow sets a new state of the art on nuScenes for domain-partitioned autonomous driving, where online map reconstruction and multi-object tracking are supervised in different geographic regions. It also delivers consistent gains across all three NYUv2 tasks: semantic segmentation, depth estimation, and surface normal prediction. 📎 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 𝗽𝗮𝗴𝗲: ark1234.github.io/nexusflow_… 🤝 This work was conducted in collaboration across Worcester Polytechnic Institute, Texas A&M University, Tohoku University, University of Michigan, and Bosch Research. Huge thanks to collaborators: Fangzhou Lin, Yuping Wang, Yuliang Guo, Zixun Huang, Xinyu Huang, Haichong Zhang, Kazunori Yamada, Zhengzhong Tu, Liu Ren, and Ziming Zhang. #CVPR2026 #ComputerVision #MultiTaskLearning #AI #GenAI #AutonomousDriving #DeepLearning #RepresentationLearning

2,130

GLCND.IO - Lead with Logic. Think without Compromise.