Filter
Exclude
Time range
-
Near
Maybe the data-efficiency gap is not a scaling problem. Maybe it is an objective problem. A striking preprint by Daniel J. Korchinski, Alessandro Favero, and Matthieu Wyart offers a sample-complexity theory for this shift: Learn from your own latents and not from tokens. The core problem is familiar: modern generative models are extraordinary, but brutally data-hungry. LLMs train on 10¹³–10¹⁴ tokens. Children do not. So the question is not only: How do we scale models? It is: What are we asking them to predict? Most of modern AI trains on the visible surface: next tokens, masked tokens, pixels, noise. That works. But it may be statistically inefficient for learning hierarchy. The authors study a tractable hierarchical grammar where visible tokens are generated from a hidden latent tree of depth L — a stylized model for the compositional structure of language and images. The result reframes the debate: token-level learning requires samples exponential in L to recover the hidden tree. latent prediction recovers it with sample complexity essentially constant in L, up to logarithmic factors. In plain English: predicting tokens forces the model to infer the hierarchy through the leaves. predicting latents lets the model climb the tree. Once one abstraction level is learned, it becomes the substrate for learning the next. This is why data2vec and JEPA-style objectives are so interesting. They do not merely reconstruct the input. They train a network to predict its own latent representation of another view or masked region. The target is no longer the surface. The target is the model’s own emerging abstraction. The paper validates the theory three ways: a hierarchical clustering algorithm an end-to-end neural architecture trained by gradient descent a sample-complexity analysis of data2vec, showing it implicitly performs hierarchical latent prediction One implication is provocative: if data2vec already discovers hierarchy implicitly, explicit stacking schemes such as H-JEPA may be partly redundant. This is not “next-token prediction is dead.” Next-token prediction built the current era. But if the goal is biological-level data efficiency, surface reconstruction may be the expensive path. The strategic frontier may be latent self-prediction: models learning not only from what they see, but from the abstractions they are forming. Full credit to the authors: Daniel J. Korchinski, Alessandro Favero, Matthieu Wyart. Paper: Learn from your own latents and not from tokens: A sample-complexity theory arxiv.org/abs/2605.27734 I’m attaching the first page because the abstract is worth reading closely. The future of data-efficient AI may not be more tokens. It may be better targets. #AIResearch #MachineLearning #SelfSupervisedLearning #RepresentationLearning #DataEfficiency #LLM
3
1
10
1,024
The data-efficiency gap between machines and children may not be solved by “more tokens.” It may be solved by changing what the model is asked to predict. A beautiful new paper by Daniel J. Korchinski, Alessandro Favero, and Matthieu Wyart gives a sample-complexity theory for a major alternative to token-level learning: Learn from your own latents and not from tokens. The premise is striking. Modern generative models learn by predicting raw surface fragments: next tokens masked tokens pixels noise patches This works spectacularly. But it is brutally data-hungry. Biological learners do not see 10¹³–10¹⁴ tokens before acquiring rich language competence. So perhaps the bottleneck is not only architecture, scale, or optimization. Perhaps it is the prediction target. Instead of predicting tokens, methods like data2vec and JEPA train networks to predict their own latent representations of related views or masked regions. The model is not asked: “Can you reconstruct the surface?” It is asked: “Can you predict the abstraction your own system would form?” That difference may be enormous. The authors study a tractable hierarchical grammar that generates visible tokens from hidden latent trees of depth L — a stylized model of the compositional structure of language and images. For this data, supervised learning and token-level self-supervised learning require samples exponential in L to recover the hidden hierarchy. But latent prediction recovers the hierarchy with sample complexity essentially constant in L, up to logarithmic factors. That is the whole paper in one line: predicting tokens makes hierarchy expensive; predicting latents makes hierarchy recursive. Why? Because token-level objectives keep forcing supervision through the visible surface. The deeper the hidden structure, the weaker and more indirect the signal becomes. Latent prediction removes that bottleneck. Once one level of abstraction is recovered, the model can use its own learned latents as both context and target for the next level. Every level becomes statistically like the first. The paper confirms this three ways: a hierarchical clustering algorithm an end-to-end neural architecture trained by gradient descent a sample-complexity analysis of data2vec, showing that it implicitly performs hierarchical latent prediction The last point is especially interesting. If data2vec already discovers hierarchy implicitly, then explicitly stacking methods like H-JEPA may be partly redundant. This is not “tokens are dead.” Token prediction remains one of the most productive ideas in AI. But this paper gives a precise reason why token-level learning may be an inefficient path to latent structure. The deeper lesson: the model should not only learn from the world’s surface. It should learn from the abstractions it is already beginning to form. Full credit to the authors: Daniel J. Korchinski, Alessandro Favero, Matthieu Wyart. Paper: Learn from your own latents and not from tokens: A sample-complexity theory arxiv.org/abs/2605.27734 I’m attaching the first page because the abstract is worth reading closely. The future of data-efficient AI may not be more reconstruction. It may be recursive self-prediction in latent space. #AIResearch #MachineLearning #SelfSupervisedLearning #RepresentationLearning #LLM #DataEfficiency
1
3
6
350
Fragmentnet: Adaptive graph fragmentation for graph-to-sequence molecular representation learning 1. FragmentNet reframes molecular pretraining around learned, chemically valid fragments (not atoms): it serializes a molecular graph into a fragment sequence, masks an entire fragment, and trains a Transformer to reconstruct that missing substructure (Masked Fragment Modeling, MFM). 2. The core technical piece is an adaptive graph tokenizer that starts from atoms and iteratively merges connected pairs using a corpus-driven score (pair frequency normalized by node frequencies), storing a merge history so granularity can be changed at inference/fine-tune time by choosing how many merges to apply. 3. Unlike rigid rule-based fragmentation (e.g., BRICS) or SMILES subword tokenization, the tokenizer preserves graph connectivity and chemical validity, and explicitly represents cut points via dummy atoms (atomic number 0), so “the same” fragment with different dangling-bond environments becomes distinct tokens. 4. To uniquely index fragments (including stereochemical variants and dangling-bond context), FragmentNet builds its token dictionary using Weisfeiler–Lehman (WL) hashing with atom labels (Z, hybridization, radicals, H count) and bond labels (type, conjugation, stereo, ring membership), avoiding SMILES non-uniqueness. 5. The model is a hybrid graph-to-sequence pipeline: a VQ-VAE encodes discrete atom-level attributes into codebooks, a GCN captures intra-fragment structure, and the two are combined into fragment embeddings that are then processed by a BERT-style Transformer. 6. A key challenge in graph-to-sequence is preserving topology after serialization; FragmentNet adds “chemically aware” spatial positional encodings by summing (i) hop-based global distance summaries, (ii) WL absolute/role encodings, and (iii) Coulomb-matrix-inspired charge/interaction encodings. 7. It also replaces the standard CLS token with a learnable molecular-descriptor vector (computed from RDKit descriptors and refined through attention), aiming to provide a global summary channel alongside fragment-context modeling. 8. Empirically, with MFM pretraining on 2M molecules, fragment-level tokenization (100 merges; ~7 fragments per molecule, ~10 atoms per token on average) beats atom-level tokenization (0 merges) on 5/7 scaffold-split benchmarks (MoleculeNet Malaria); without pretraining, atom-level often does better, highlighting that granularity interacts strongly with pretraining. 9. Beyond prediction, the learned fragment vocabulary enables a fragment-swapping module for targeted analogue generation: by matching dummy-atom bond environments and sanitizing with RDKit, it can substitute fragments while preserving the core scaffold (demonstrated on ibuprofen, aspirin, diazepam) without expensive substructure search. 📜Paper: arxiv.org/abs/2502.01184 #ComputationalChemistry #Cheminformatics #MolecularML #GraphML #Transformers #SelfSupervisedLearning #DrugDiscovery #RepresentationLearning

3
940
🧬 New paper: "MoCL-GAT: Molecular Contrastive Learning with Graph Attention Network for Enhanced Molecular Representation" is out! One of the core bottlenecks in computational drug discovery is the scarcity of labelled experimental data. You can have millions of compounds, but validated bioactivity labels are expensive and rare. This limits how far supervised models can generalise across chemical space. Self-supervised learning (SSL) offers a compelling way out: learn rich molecular representations from unlabeled data first, then fine-tune on small labelled sets. 🔍 Most existing SSL methods capture either local structural patterns or global molecular properties, but rarely both at once. MoCL-GAT closes this gap with a dual-objective framework powered by a Graph Attention Network (GAT): 🟣 A local contrastive objective, contrasting augmented K-hop subgraph views to capture fine-grained atomic environments, functional groups, and pharmacophoric patterns 🟠 A global descriptor prediction objective, regressing over 78 curated RDKit physicochemical descriptors (solubility, lipophilicity, SA score…) to encode holistic molecular behaviour The attention mechanism in GAT proves key here, dynamically weighting atomic neighbours to efficiently serve both learning signals simultaneously. ⚡ Pre-trained on 1.9 million ChEMBL compounds and fine-tuned across diverse MoleculeNet benchmarks, MoCL-GAT achieves strong and competitive results on blood-brain barrier penetration, side effect prediction, solubility, and hydration free energy tasks. Ablation studies confirm that neither objective alone matches the performance of the combined dual approach. The code is implemented in PyTorch PyTorch Geometric RDKit, and pre-training runs on a single NVIDIA RTX 3090 in ~16 hours. 💚 A huge congratulations to our first author, Alperen Dalkiran @alprndalkiran, who took ownership of the implementation, experiments, and the heavy lifting of bringing this work to publication, executing with great care and technical depth throughout. Well done, Alperen! 🙌 The full team behind MoCL-GAT: Alperen Dalkiran · Ahmet Süreyya Rifaioğlu · Rengül Çetin-Atalay · Aybar C. Acar · Tunca Doğan · M. Volkan Atalay 📄 Open access @BMC_series BMC Bioinformatics: 👉 link.springer.com/article/10… 🔓 Fully open-source code, pre-trained weights, and datasets are all publicly available on Zenodo: 👉 zenodo.org/records/16927286 #DrugDiscovery #GraphNeuralNetworks #Cheminformatics #MolecularRepresentationLearning #ContrastiveLearning #SelfSupervisedLearning #ComputationalBiology #Bioinformatics
12
680
A 37-million-particle dataset from over 250 experiments to accelerate data-driven cryo-EM analysis 1. The paper introduces cryoPANDA (cryo-EM Particles ANnotated DAtaset): 37,623,123 curated experimental particle images from 252 cryo-EM experiments, designed to remove the main bottleneck for particle-level foundation models in cryo-EM: lack of large, diverse, richly annotated real data. 2. Scale and diversity are key: cryoPANDA spans 16 function-based protein classes and broad molecular-weight ranges (mean ~600 kDa; min 21 kDa; max 200,000 kDa), aiming to support models that generalize across targets and imaging conditions rather than being retrained per experiment. 3. Rich per-particle annotations go far beyond picking coordinates, covering acquisition parameters (e.g., voltage, dose, Cs), CTF estimates (defocus U/V, astigmatism angle), 2D classification statistics (class, alignment resolution, ESS, ECA), and 3D reconstruction metadata (Euler angles, translations, alignment error), plus links to EMDB maps and (when available) PDB models. 4. Dataset construction is not a simple scrape: the authors examined 495 EMPIAR entries, used sequence similarity (>30%) to cluster entries and reduce redundancy, then selected up to four representatives per cluster with manual curation for data quality and documentation, yielding 252 final experiments (mostly EMPIAR 5 in-house). 5. A standardized cryoSPARC v4.6 processing pipeline is used to curate particles and attempt reconstructions: CTF estimation (when starting from micrographs), picking (blob picker or author coordinates), multiple rounds of 2D classification/selection with recovery of mistakenly rejected classes, duplicate removal using estimated particle diameter, and typical ab initio refinement steps for 3D maps. 6. Reconstruction quality is validated against published EMDB maps (for cases with reported reconstructions): among 214 experiments with cryoPANDA reconstructions, 75 (35%) achieve better reported resolution than the published map and 139 (65%) are worse; differences are often explained by cryoPANDA using smaller particle subsets, with results becoming broadly comparable when particle fractions match. 7. A major contribution is demonstrating foundation-model readiness: the authors train a DINOv2 ViT-L/16 model from scratch on ~32M particles (215 experiments) and test generalization on 37 held-out experiments (~5M particles), using an experiment-level split to avoid leakage across near-identical acquisition settings or targets. 8. Without task-specific fine-tuning, the pretrained model yields micrograph-level representations that separate particle regions from background via sliding-window feature extraction and PCA-to-RGB visualization, despite the model being trained only on cropped particle images (not full micrographs). 9. The paper also shows a fully unsupervised particle-picking pipeline built on frozen DINOv2 features, evaluated on held-out EMPIAR-10017 with Henderson’s manual annotations: 91.5% recall, 45.5% precision (F1 60.8%). After downstream cryoSPARC cleanup, the picked particles support a 3D reconstruction at 4.38 Å, close to the published 4.20 Å and the cryoPANDA pipeline’s 4.29 Å for the same dataset. 10. Using cryoPANDA’s metadata, linear probes on frozen DINOv2 features can predict multiple particle properties (symmetry, pixel size, molecular weight, max diameter, EMDB resolution, defocus). Cross-experiment performance drops vs in-distribution, and the authors quantify that part of this gap comes from acquisition-parameter entanglement; regressing out acquisition parameters improves OOD accuracy across tasks, illustrating how the dataset enables mechanistic analysis of generalization failures. 💻Code: github.com/azamanos/cryoPAND… 📜Paper: biorxiv.org/content/10.64898… #cryoEM #StructuralBiology #DeepLearning #FoundationModels #SelfSupervisedLearning #Datasets #Bioinformatics #ComputationalBiology #EMPIAR #EMDB
3
23
2,092
The broader message is that scientific foundation models may need the right **pre-training space**, not just more data. By lifting geometry into a space that reflects downstream physical structure, GeoPT offers a scalable path toward physics-aware world models. Huge thanks to co-authors Haixu Wu @Haixu_Wu_1998, Zongyi Li @zongyili_nyu, Zhiyang Dou @frankzydou, Mingsheng Long, Kaiming He, and Wojciech Matusik @wojmatusik. (6/6) Paper: arxiv.org/abs/2602.20399 Workshop: fm-science.github.io/ #GeoPT #NeuralSimulation #PhysicsSimulation #ScientificML #FoundationModels #AI4Science #Pretraining #SelfSupervisedLearning #NeuralOperators #CFD #DigitalTwins
1
2
15
939
Meta-encoder: a unified integration framework for multiple pathological foundation models in cancer detection 1. The paper introduces Meta-Encoder, a plug-and-play feature-fusion framework that integrates representations from multiple pathology foundation models (patch-level: CHIEF, GigaPath, UNI; WSI-level: TITAN, PRISM) without re-pretraining or access to the original pretraining data. 2. Core motivation: different pathology foundation models excel on different downstream tasks, but privacy constraints and heterogeneous architectures make centralized retraining impractical; Meta-Encoder instead fuses frozen/available embeddings during downstream fine-tuning to reduce model-selection burden and improve robustness. 3. Four fusion strategies are benchmarked: (i) raw feature concatenation (no normalization/alignment), (ii) concatenation self-attention (as a lightweight non-linear “soft-gating” recalibration), (iii) cross-attention between model embeddings, and (iv) contrastive-loss regularization to align multi-model views during supervised fine-tuning. 4. Key empirical pattern across many oncology tasks: for low-complexity, single-output tasks (tumor subtyping, survival risk scoring), Meta-Encoder usually matches the best single model; concatenation is often sufficient and acts as a practical “safe default” when the best encoder is unknown. 5. For WSI-level subtyping, self-attention can slightly reduce AUC but improves probability calibration (lower expected calibration error), suggesting a tradeoff where fused models may provide more reliable confidence estimates even when ranking metrics change minimally. 6. For survival prediction (TCGA-BRCA), Meta-Encoder matches the strongest single models while improving stability of risk stratification across repeated CV splits (e.g., higher fraction of significant log-rank separations vs single encoders), reinforcing its role as a robust alternative to manual encoder selection. 7. The strongest gains appear in structured, high-dimensional molecular prediction: multi-label biomarkers (e.g., RAS/BRAF/MSI), multiplex protein quantification (15 markers, ORION-CRC), and spatial/bulk gene expression prediction. Here, attention-based fusion (especially self-attention) is consistently favored for performance-efficiency balance. 8. External generalization highlight (CRC biomarkers): trained on TCGA-CRC and tested on the independent SurGen-CRC cohort, self-attention fusion improves mean AUC (e.g., from 0.6560 with best single model PRISM to 0.7367) and markedly boosts sensitivity at a fixed 90% specificity (35.95% to 60.81%), emphasizing robustness under domain shift. 9. Mechanistic interpretability: SHAP analyses show Meta-Encoder dynamically reweights which encoder contributes most depending on the molecular target (e.g., UNI dominates many proteins, while GigaPath dominates others; PRISM often dominates many genes but TITAN adds complementary signal for specific targets), supporting the “complementary strengths” hypothesis rather than mere parameter scaling. 10. Practical deployment guidance from compute profiling: concatenation has minimal overhead; self-/cross-attention remain lightweight (controlled memory/time overhead), while contrastive-loss is often prohibitively expensive (large increases in GPU memory and training time). The authors recommend concatenation for simpler tasks and self-attention for complex/high-dimensional outputs. 📜Paper: doi.org/10.1038/s41467-026-7… #ComputationalPathology #DigitalPathology #FoundationModels #SelfSupervisedLearning #CancerDetection #PrecisionOncology #MIL #SpatialTranscriptomics #BiomarkerPrediction #WSI
5
10
851
From Nucleotides to Semantics: Genomic Representation Learning via Joint-Embedding Predictive Architecture 1 GenoJEPA reframes genomic pretraining away from nucleotide-level reconstruction (MLM/NTP) and toward latent-space semantic alignment, motivated by the idea that DNA lacks explicit “word boundaries” and contains substantial evolutionary noise. 2 The core pretraining objective is a JEPA-style multi-view alignment (adapted from LeJEPA): multiple local/global crops of the same sequence are embedded and aligned to the mean global-view representation, using an invariance loss plus SIGReg to prevent representation collapse. 3 A key design choice is continuous patching for DNA: sequences are split into non-overlapping nucleotide patches (patch size 16), embedded, flattened, and linearly projected—compressing effective sequence length and avoiding BPE/k-mer vocabulary inflation and mutation sensitivity. 4 The backbone is a ModernBERT-style bidirectional Transformer (RoPE, pre-norm, largely bias-free). Two scales are reported: GenoJEPA-T (6M params) and GenoJEPA-B (52M params), targeting practical accessibility. 5 Pretraining uses a large multi-species corpus (850 representative species; ~193B nucleotides after filtering). Sequences are segmented into 4096 bp windows with 512 bp overlap; fragments with >15% unknown bases are removed. 6 The most application-oriented result is strong frozen-feature performance: across 55 downstream tasks, a frozen GenoJEPA encoder simple GPU-free logistic regression probing achieves top overall performance (MCC), often surpassing much larger baselines. 7 Under probing, GenoJEPA-B wins the most tasks in pairwise comparisons (Wilcoxon signed-rank at p=0.05) against HyenaDNA, CaduceusPh, GROVER, DNABERT-2, and NT-v2; GenoJEPA-T remains competitive with models up to ~100x larger. 8 Under full finetuning, performance gaps narrow but GenoJEPA-B still ranks best on average; notably it improves average finetuning performance versus NT-v2 despite using ~10x fewer parameters (both pretrained on the same 850-species corpus). 9 Efficiency analysis (runtime/memory vs nucleotide length) shows GenoJEPA’s patching yields stable memory and favorable speed across many practical lengths; some “sub-quadratic” long-sequence baselines do not realize their theoretical advantages in these controlled measurements until extremely long contexts. 10 Few-shot probing (10%–50% labeled data) indicates strong data efficiency: GenoJEPA maintains high MCC with limited labels, and its embeddings work well with simple linear decision boundaries; average pooling tends to be the most stable aggregation choice across tasks. 📜Paper: biorxiv.org/content/10.64898… #ComputationalBiology #Genomics #SelfSupervisedLearning #RepresentationLearning #Transformers #FoundationModels #Bioinformatics #MachineLearning
10
37
3,266
🌾🤖 PhD Opportunity | Foundation Models for Agricultural Sciences – Wageningen University & Research 🇳🇱 A fully funded PhD position is available in the Artificial Intelligence group at Wageningen University & Research — the world's leading university for life sciences — to develop cutting-edge AI foundation models for agricultural applications, as part of the EU-funded AgriscienceFM project. 🔬 What you'll research: Current AI models frequently fail to generalise in agricultural settings. You'll investigate how to develop, design, and evaluate domain-specific foundation models that learn from multi-modal heterogeneous data — including text, location, and imagery — for applications such as: • Crop type classification & yield forecasting • Field boundary delineation & crop disease detection • Earth observation, climate modelling & phenotyping Your research will build on self-supervised, contrastive, physics-informed and knowledge-guided machine learning, designing multi-modal architectures for image and time series data. Large-scale model training will be conducted on HPC infrastructure. 📋 Requirements: • MSc in AI, computer science, engineering, or related field • Demonstrated experience in applied machine learning (remote sensing or agricultural applications preferred) • Proficient in Python; experience with PyTorch, Scikit-Learn or similar frameworks • Strong scientific writing skills (C1 English level) 💰 Salary: €3,059 – €3,881/month (fully funded, 4-year contract) 📍 Location: Wageningen, Netherlands (greenest campus in the Netherlands) 📅 Application deadline: 4 May 2026 🗓️ First interviews: 15 May 2026 ⚠️ Apply with CV, motivation letter & writing sample (max 3 pages each) — via WUR website only 👨‍🔬 Supervisors: • Prof. Ioannis Athanasiadis (PI) → ioannis.athanasiadis@wur.nl • Prof. Ricardo Torres & Dr. Taniya Kapoor (co-supervisors) 📧 Recruitment enquiries: Noorien Abbas → noorien.abbas@wur.nl 🔗 Full details & apply: phdscanner.com/opportunities… ♻️ Share with anyone interested in AI, machine learning & sustainable agriculture! #PhD #PhDOpportunities #ArtificialIntelligence #FoundationModels #MachineLearning #DeepLearning #RemoteSensing #Agriculture #FoodSecurity #SelfSupervisedLearning #WageningenUniversity #Netherlands #AgriscienceFM #DoctoralResearch #AcademicJobs #ResearchJobs #EUFunded
1
2
348
Vinith Kishore et al.: ICECREAM: high-fidelity equivariant cryo-electron tomography #CryogenicElectronTomography #SelfSupervisedLearning #MachineLearning @insadelyon... #IUCr journals.iucr.org/paper?S205…

2
229
SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction 1. SMILES-Mamba is a two-stage foundation model for small-molecule ADMET prediction that first learns from large-scale unlabeled SMILES (self-supervised next-token prediction) and then fine-tunes on small labeled ADMET datasets, aiming to reduce reliance on expensive wet-lab labels while improving generalization. 2. The key design choice is using Mamba (structured state space sequence modeling) as the SMILES backbone, motivated by efficiency on long sequences and strong capacity to capture both local and long-range dependencies in tokenized chemical strings compared with attention-heavy sequence models. 3. Pretraining is property-agnostic: the model is trained autoregressively on a 250K sampled subset of ZINC SMILES (no ADMET labels), simultaneously establishing a SMILES token vocabulary (atoms, bonds, ring indices, brackets, etc.) and learning transferable chemical sequence representations. 4. Fine-tuning is property-specific: the pretrained SMILES-Mamba is adapted to each downstream endpoint (classification or regression), using scaffold splits (train/val/test = 7/1/2) to better mimic real discovery where test molecules have unseen scaffolds. 5. The evaluation spans 22 ADMET tasks across Absorption, Distribution, Metabolism, Excretion, and Toxicity, including endpoints such as Caco2 permeability, HIA, Pgp inhibition, AqSol solubility, BBB penetration, PPBR, VDss, multiple CYP inhibition/substrate tasks, half-life, clearance, hERG, AMES, DILI, and LD50. 6. Against common baselines (Morgan fingerprints MLP, SMILES 1D-CNN, GCN, and self-supervised NeuralFP), SMILES-Mamba reports strongest overall results: best in 14/22 tasks and top-2 in 17/22, suggesting that sequence pretraining on unlabeled SMILES can be highly competitive for ADMET. 7. Notable wins include large gains on metabolism tasks (e.g., PR-AUC improvements for CYP inhibition/substrate prediction) and strong performance on clinically relevant safety endpoints like DILI (ROC-AUC 0.928), while results also show that no single representation dominates every endpoint. 8. The paper highlights a practical takeaway: different model families capture complementary signals (graph models emphasize local substructures; SMILES sequence models capture broader string-level patterns), motivating future hybrid/ensemble feature integration for further ADMET gains. 9. Implementation details: PyTorch-based training on an RTX 3090; pretraining up to 100 epochs and fine-tuning up to 50 with early stopping; 8-layer model with hidden size 300; Adam optimizer with 1e-3 learning rate (the paper notes “attention heads,” though the backbone is Mamba rather than a standard Transformer). 📜Paper: arxiv.org/abs/2408.05696 #ComputationalBiology #Cheminformatics #DrugDiscovery #ADMET #FoundationModels #SelfSupervisedLearning #Mamba #SMILES #MachineLearning #Bioinformatics #ToxicityPrediction #Pharmacokinetics
7
23
1,615
Self-supervised learning for a gene program-centric view of cell states 1 Tripso is a self-supervised transformer framework that represents each cell with multiple gene program (GP)-specific embeddings (rather than a single entangled cell embedding), enabling program-resolved comparisons across development, disease, and experimental systems. 2 Architecture in brief: a gene encoder learns contextualized gene embeddings within each cell; genes are routed into dedicated GP-specific transformer blocks (each with a CLS token summarizing the program); a global cell block then attends over GP embeddings and is trained to reconstruct counts with a negative binomial loss. 3 Beyond curated programs, Tripso includes a data-driven GP discovery mode: attention patterns among genes are used to (i) rank genes relevant to specific states and (ii) cluster genes with similar attention profiles into novel, context-specific programs. 4 Interpretability is built in at two levels: gene-to-GP importance via cosine similarity between gene embeddings and the GP CLS token; and GP-to-cell importance via systematic ablation of each GP embedding and measuring the induced change in the cell representation. 5 Benchmarks on a large Perturb-seq resource (623k cells; TNFα/TGFβ stimulation 98 perturbations) show GP embeddings that better separate pathway stimulations and stronger genetic perturbations than Spectra (NMF) and Expimap (interpretable VAE), and outperform non-ML baselines (gene-set scoring; concatenated expression of GP genes). Tripso also improves robustness to batch effects compared with raw expression in GP space. 6 In human hematopoiesis across the lifespan (~499k in vivo cells from 98 donors in vitro corpora), Tripso recovers expected lineage programs (e.g., GATA1 in erythropoiesis; RUNX1 in myeloid/megakaryocyte differentiation) and exposes age-specific GP shifts, including elevated pediatric JAK-STAT importance in HSC/MPP populations with gene-level signals enriched for type I interferon response. 7 Tripso resolves developmental changes in early B-lineage states specifically within the IKZF1 GP embedding: Milo differential abundance in IKZF1 space separates prenatal vs postnatal pro-B neighborhoods, while an unrelated control GP (WNT) does not. Gene-level importance suggests a shift from prenatal proliferative/IL7R-linked programs toward postnatal pre-BCR diversification (e.g., IGLL1/VPREB1, DNTT). 8 For in vivo vs in vitro mapping, Tripso supports GP-anchored alignment using unbalanced optimal transport (e.g., in GATA1 GP space it recapitulates a truncated in vitro erythroid trajectory missing terminal erythroblasts without using prior knowledge of the protocol). 9 Tripso enables actionable perturbation prioritization for HSC culture: focusing on PI3K as an in vivo HSC-distinctive GP, distributional comparisons in PI3K GP space (Sinkhorn divergence) indicate 3a culture is closest to adult BM LT-HSCs. Gene-level importance within PI3K nominates ER translocon components (SSR1; and SEC61G in a related GP) as higher in less stem-like states. 10 Experimental validation: inhibiting the SEC61 translocon (SEC61-IN-1) increases the frequency of immunophenotypic HSCs (CD34 CD45RA− CD90 EPCR ) in UM171 and SR-1 cultures (but not in 3a, consistent with the prioritization setup), illustrating how GP-resolved signals can identify candidates that would be hard to rank by small-effect differential expression alone. 11 In inflammatory skin (1.7M cells across 338 biopsies, 14 diseases), Tripso GP discovery (restricted to a spatial panel for direct validation) yields programs with limited one-to-one overlap with PROGENy/MSigDB, capturing novel gene combinations. A lymphoid program (GP23) shows an atopic-dermatitis-selective profile, elevated in IL13 TRM cells, and is enriched for inflammatory signaling, metabolic adaptation, and trafficking/turnover genes not well covered by canonical immune annotations. 12 Spatial validation with matched Xenium transcriptomics and spatial proteomics links GP23-high regions to discrete immune-dense niches adjacent to sebaceous glands and inflamed epidermis, frequently co-localizing with high CD45 protein signal and proximity to T cell aggregates; GP23 remains elevated in relapsed AD after treatment withdrawal, consistent with niche-associated TRM persistence. 📜Paper: biorxiv.org/content/10.64898… #SingleCell #scRNAseq #Transformers #SelfSupervisedLearning #GenePrograms #Interpretability #Hematopoiesis #StemCells #SpatialTranscriptomics #AtopicDermatitis
3
13
1,174
Multi-view contrastive learning boosts cotton boll detection by 14% with minimal labels—accelerating field robot phenotyping! #AgRobotics #SelfSupervisedLearning #PrecisionAg Details: doi.org/10.1016/j.plaphe.202…
1
2
224
By leveraging self-supervised learning, companies can minimize the amount of manually labeled data required, thereby enhancing the efficiency of the AI model training process. #SelfSupervisedLearning #AITraining #DataScience
3
7
80
🌍 No labeled/registered images? No problem! 🔥 S3FCD revolutionizes remote sensing change detection with single-temporal self-supervised learning—no expert annotations needed! 🚀 1.4k views can’t be wrong. #RemoteSensing #AIInnovation #ChangeDetection #SelfSupervisedLearning #TechBreakthrough Link[doi.org/10.1080/10095020.202…]
3
8
400
🚀 Our paper at ICLR26 show that powerful priors can be learned entirely from object instances (no labels, no language, no manual annotations) -> test-time iterative inference and true OOD generalization. Read here: arxiv.org/pdf/2410.03858 #ICLR2026 #SelfSupervisedLearning #AI
1
10
76
6,492
🤖 A new breed of AI systems that understand the world, have persistent memory, can reason and plan, and are controllable and safe #AI @amilabs #AIIntelligence #SelfSupervisedLearning
Advanced Machine Intelligence (AMI) is building a new breed of AI systems that understand the world, have persistent memory, can reason and plan, and are controllable and safe. We’ve raised a $1.03B (~€890M) round from global investors who believe in our vision of universally intelligent systems centered on world models. This round is co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions, along with other investors and angels across the world. We are a growing team of researchers and builders, operating in Paris, New York, Montreal and Singapore from day one. Read more: amilabs.xyz/ AMI - Real world. Real intelligence.
2
6
244
Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles 1. Introducing RigidSSL, a two-phase geometric pretraining framework that front-loads protein geometry learning before generative finetuning, addressing the challenge of jointly learning geometry and generation in protein design. 2. Phase I (RigidSSL-Perturb) learns geometric priors from 432K AlphaFold structures using simulated SE(3) perturbations, treating each residue as a rigid body with independent translational and rotational noise. 3. Phase II (RigidSSL-MD) refines representations on 1.3K molecular dynamics trajectories to capture physically realistic conformational transitions, bridging the gap between static structures and dynamic flexibility. 4. The method employs a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics via LERP in R3 and SLERP in SO(3), maximizing mutual information between paired conformations. 5. RigidSSL-Perturb improves designability by up to 43% in unconditional generation while maintaining novelty and diversity, and achieves 5.8% higher success rate in zero-shot motif scaffolding compared to baselines. 6. RigidSSL-MD captures more biophysically realistic conformational ensembles in GPCR modeling, outperforming on 7 out of 9 evaluation metrics including weak contacts and exposed residue predictions. 7. The framework enables generation of ultra-long proteins (700-800 residues) with superior stereochemical quality, achieving the lowest Clashscore and MolProbity scores among all pretraining methods. 8. Key innovation: explicit rigidity constraints through residue-level SE(3) frames reduce degrees of freedom and enforce physical plausibility, unlike prior methods relying on local non-rigid atomic representations. 💻Code: github.com/this 📜Paper: arxiv.org/abs/2603.02406 #proteindesign #geometricdeeplearning #selfsupervisedlearning #alphafold #moleculardynamics #flowmatching #computationalbiology #structuralbiology #machinelearning #iclr2026
6
41
2,449