Annals of Computer Science and Information Systems

Annals of Computer Science and Information Systems

Users
Tweets

Annals of Computer Science and Information Systems @annals_csis

Apr 23

Exploring the performance of #GBDT under a #FSL schema in order to provide strong baselines: “Gradient Boosting Trees and #LargeLanguageModels for #TabularData Few-Shot Learning” by Carlos Huertas. ACSIS Vol. 41 p. 53–59; tinyurl.com/2h84f7fz

119

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Apr 21

Tabular foundation models for in-context prediction of molecular properties 1. The paper evaluates a training-free workflow for molecular property prediction: compute fixed molecular features/embeddings, then use tabular foundation models (TFMs) for in-context learning (no task-specific fine-tuning). 2. Key result on public low-to-medium data benchmarks (58 tasks total from Polaris MoleculeACE): TabPFN frozen CheMeleon embeddings (TabPFN-CheMeleonFP) achieves 50/58 wins (86.2% win rate) with average rank 4.52, outperforming classical ML baselines and several fine-tuned deep molecular models under matched splits/metrics. 3. On the activity-cliff focused MoleculeACE suite (30 tasks), TabPFN-CheMeleonFP is best or statistically tied for best on all 30 tasks (100% win rate; average rank 2.10). This suggests TFMs can be particularly strong in difficult generalization regimes when paired with the right representation. 4. The study’s framing is notable: TFMs (TabPFN, TabICL) are pretrained only on synthetic tabular tasks (SCM/function-sampled), yet transfer effectively to chemistry once molecules are converted into tabular vectors (descriptors or frozen foundation-model embeddings). 5. Representation choice is a major driver of performance (contrary to some prior claims of representation invariance for TabPFN in drug discovery). CheMeleon embeddings and 2D descriptors (RDKit2d, Mordred) are consistently strong; Morgan fingerprints are substantially weaker across many tasks. 6. Descriptor-based alternatives remain competitive: TabPFN-RDKit2d and TabPFN-Mordred deliver strong aggregate results (e.g., 56.9% and 67.2% win rates respectively across the 58 tasks), offering practical options when foundation-model embeddings are unavailable or costly. 7. Compute efficiency is a central advantage: in a runtime case study vs fine-tuned CheMeleon, TabPFN-CheMeleonFP is faster on both CPU and GPU, with speedups up to 27× (CPU) and 46× (GPU), while also improving accuracy. 8. Beyond pharma benchmarks, the paper tests 11 chemical engineering datasets (fuels, polymer properties, polymer–solvent interactions). TFM pipelines (especially with Mordred or RDKit2d) match or exceed strong literature baselines on multiple targets, and remain competitive even when specialized models lead (e.g., PolySolv). 9. Limitations and open directions: scaling to larger datasets can become memory/disk constrained for high-dimensional features; most experiments are single-molecule (mixtures/multi-component systems may be harder); TFMs are single-task by design, motivating future in-context multitask TFMs and uncertainty-aware workflows (active learning/Bayesian optimization). 💻Code: git.rwth-aachen.de/avt-svt/p… 📜Paper: arxiv.org/abs/2604.16123 #ComputationalChemistry #Cheminformatics #MolecularML #FoundationModels #InContextLearning #TabPFN #TabularData #DrugDiscovery #ChemicalEngineering #MaterialsInformatics

1,464

H2O.ai

H2O.ai

@h2oai

Apr 16

On TALENT — an independent benchmark of 300 classification and regression tasks — TabH2O v1 outranked tuned CatBoost, and LightGBM across the board. With zero hyperparameter tuning. Read the blog to learn more: h2o.ai/blog/2026/introducing… #AI #MachineLearning #TabularData #FoundationModels

0:43

456

Kabir Oberai

Kabir Oberai

@kabiroberai

Apr 13

x.com/i/article/204375245148…

191

140,357

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Feb 18

Tabular data is where the money 💰is. ✅Banks don’t run on poems. ✅Insurance doesn’t price with selfies. ✅Hospitals don’t triage with cat pictures. They run on tables. And yet the AI hype cycle spent years pretending “foundation models” = text images. Meanwhile tabular ML was stuck in: XGBoost worship leaderboard cosplay. Until now. A new species is here: Tabular foundation models. In-context learning for structured data. TabPFN. TabICL. And the wave that follows. But here’s the problem: 🚨 Most of the “wow SOTA” claims are made on sand. Because: ❌ one split ≠ evaluation ❌ weak CV = free performance ❌ ensembling changes rankings ❌ leakage is everywhere ❌ “beats GBDTs” often means “beat a bad baseline” So I’m writing the book I wish existed: 📘 Foundation Models for Tabular Data In-Context Learning, Benchmarking, and Advanced ML ❌It’s not a tutorial. ✅ It’s a rigor upgrade. You’ll learn: ✅ what is actually a “tabular foundation model” (and what’s marketing) ✅ when these models genuinely beat gradient boosting ✅ how to benchmark without fooling yourself ✅ calibration & reliability (the part that matters in risk) ✅ decision frameworks you can defend to a skeptical team lead Early access preorder is live. Price will climb past $100 as content lands. Core: 👉 valeman.gumroad.com/l/tabula… Pro 🔥: 👉 valeman.gumroad.com/l/tabula… If you work with tabular data and you’re tired of leaderboard theatre — welcome. #tabulardata #machinelearning #datascience

1,333

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Feb 7

‼️🚨Tabular LLMs don’t “generalize” — they game the eval. A new arXiv paper (2602.04031) re-checks the claims and the headline result collapses: Compare to basic baselines and much of the gain vanishes. “Wins” are driven by task formatting (quartile/binning) more than learning tabular structure. A plain instruction-tuned model (no tabular training) gets ~90% of the reported classification performance. That’s template familiarity, not intelligence. Contamination/leakage is a real risk, meaning some “generalization” can be memorization in disguise. Call it what it is: benchmark theater. If your tabular “foundation model” can’t reliably beat strong non-LLM baselines under, clean splits proper controls, it’s not a breakthrough — it’s a demo. #MachineLearning #TabularData #ModelEvaluation

1,585

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Feb 4

Production latency isn’t a “nice to have” – it’s a hard constraint. That’s where CatBoost’s symmetric (oblivious) trees quietly change the game. Because every level of a CatBoost tree uses the same split, you get: * Perfectly balanced trees * Very small, cache-friendly models * Branchless, vectorized, highly parallelizable inference In practice, this means CatBoost models are often 2–15× faster at inference than equivalent irregular trees. When you’re serving tens of thousands of predictions per second, that’s not just a benchmark detail – it’s the difference between: * Needing 20 servers vs 5 * Hitting your p95 latency SLO vs constantly firefighting * Being able to ship a more complex model without blowing up infra costs So if you’re in a setting where every millisecond matters — ads, search ranking, real-time pricing, fraud detection, large-scale recommendations — CatBoost isn’t just “another GBDT." It’s a latency-optimized, production-friendly workhorse that lets you keep accuracy and speed instead of trading one for the other. 'Mastering CatBoost' -> valeman.gumroad.com/l/Master… #catboost #datascience #tabulardata

599

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Jan 31

🚀 New Chapter Released in Mastering CatBoost 🔥🔥🔥 A new chapter just dropped in Mastering CatBoost: The Hidden Gem of Tabular AI — and this one is a major milestone for the book. This chapter opens CatBoost as a system, not just a library or a collection of tricks. If you’ve ever felt that CatBoost “behaves differently” from XGBoost or LightGBM but couldn’t quite articulate why — this chapter finally explains it. What this new chapter delivers (and why it’s different) This is not documentation and not a rehash of the original papers. It connects theory, engineering, and hardware in one coherent pipeline. You’ll learn: • The No-Peeking Contract How CatBoost enforces leakage-free learning by design — and how this single rule shapes data storage, statistics, training loops, and memory layout. • The Two-Brain Architecture Why CatBoost is best understood as: – a Statistics Engine (permutation-aware, ordered, unbiased) – a Tree / Hardware Engine (symmetric trees, SIMD-friendly, branchless inference) • Permutation Machinery Done Right How ordered target statistics and ordered gradients are implemented without training N models, using sliding prefixes and supporting models. • Why Symmetric (Oblivious) Trees Pay Twice One structural choice gives: – built-in regularization on noisy / categorical data – extreme inference speed via bitwise scoring • Inference as Bitwise Computation How CatBoost turns tree evaluation into comparisons → bits → leaf index, enabling production-grade throughput. • Architecture → Parameters Mapping Parameters finally make sense once you see which subsystem they actually control. This chapter explains why CatBoost “just works” on real-world tabular data — not by magic, but by architecture. 📘 Get the book Standard edition: 👉 valeman.gumroad.com/l/Master… Pro edition (early access, updates, deeper material): 👉 valeman.gumroad.com/l/Master… If you work seriously with tabular data, this new chapter alone is worth it. #catboost #machinelearning #tabulardata

716

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Jan 27

584

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Jan 21

721

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Jan 19

980

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Jan 15

767

Frank Hutter

Frank Hutter

@FrankRHutter

Jan 13

This is a special moment for us at @prior_labs. I’ve followed @ylecun’s work since the start of my career. His relentless focus on "what’s next" in AI has always been an inspiration. Today, I’m proud to say that he is joining us as a scientific advisor to help build the future of Tabular Foundation Models. Yann understands that while LLMs have captured the world's attention, they have severe limitations. The vast majority of the world's data is tabular and LLMs are terrible with statistics and numbers. Our model, #TabPFN, fills this void, and the progress we are seeing is exponential. A warm welcome to the team, Yann. It’s an honor to work with you alongside @bschoelkopf, Madelon Hulsebos, and @SamuelMullr on this powerhouse board. #DeepLearning #DataScience #TabularData #AI

Prior Labs

@prior_labs

Jan 13

Honored to announce that Yann LeCun @ylecun is joining Prior Labs’ Scientific Advisory Board.

221

20,250

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Jan 11

922

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Jan 7

779

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

Jan 7

The Era of "Manual" Gradient Boosting is Over. Remember when we used to manually initialize weights in neural networks? That’s what manual categorical encoding looks like in 2026. For years, XGBoost was the naked "king" of Kaggle. But as datasets grew more complex and categorical-heavy (e-commerce, finance, fraud), the "old guard" of GBDTs started showing their age. They require extensive preprocessing pipelines to handle what modern architectures handle natively. The Reality Check: XGBoost: Requires you to decide how to handle high-cardinality features. CatBoost: Handles them automatically using Ordered Boosting—transforming distinct values into statistical signals without overfitting. If your training script is 60% preprocessing and 40% modeling, you are using the wrong tool. Upgrade your stack. valeman.gumroad.com/l/Master… valeman.gumroad.com/l/Master… #AI #MachineLearning #TabularData #TechDebt #CatBoost

Mastering CatBoost - Pro Edition

🔥 Pro Edition: Mastering CatBoost — The Hidden Gem of Tabular AI(Early Access)The elite version of the book — trusted by data science leaders in 100 countries.Unlock the premium toolkit behind...

valeman.gumroad.com

913

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

22 Dec 2025

1,812

Céline Coustaut

Céline Coustaut

@coustautc

16 Dec 2025

AI agent Ana wants to speak complex data with natural language @datanews_nl bit.ly/4rRnUyM @TextQL #MultiCloud #AI #Agent #MCP #SQL #Automation #DataIntegration #TabularData #ITPT @ITPressTour 64th Edition in New York

Valeriy M., PhD, MBA, CQF

Valeriy M., PhD, MBA, CQF

@predict_addict

8 Dec 2025

1,151

Ofir Lindenbaum

Ofir Lindenbaum

@Ofirlin

2 Dec 2025

🚀 Heading to #NeurIPS 2025 in San Diego! Proud to present two posters with my students: • SUMO — “Subspace-Aware Moment-Orthogonalization” — a new optimizer that speeds up convergence, cuts memory demands, and boosts performance. Catch us on Wednesday, Exhibit Hall C,D,E #910, 11:00–14:00 PST. 📈 arXiv: arxiv.org/abs/2505.24749 • TANDEM — “Hybrid Autoencoders for Tabular Data” — a model-based augmentation scheme that combines decision-tree and neural encoders to push the boundaries of NNs in low-label settings. See us on Friday, Exhibit Hall C,D,E #3416, 11:00–14:00 PST. 🚀 NeurIPS page: neurips.cc/virtual/2025/post… If you’re into tabular data, efficient fine-tuning, or anything close to my line of research, come say hi and chat! 🙌 #machinelearning #tabulardata #LLMs #selfsupervised #NeurIPS25

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating...

Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs), enabling operations within constrained hardware...

arxiv.org

538