Maximilian Beck

Maximilian Beck

49 Photos and videos

Tweets

Pinned Tweet

Maximilian Beck @maxmbeck

19 Mar 2025

Yesterday, we shared the details on our xLSTM 7B architecture. Now, let's go one level deeper🧑‍🔧 We introduce ⚡️Tiled Flash Linear Attention (TFLA), ⚡️ A new kernel algorithm for the mLSTM and other Linear Attention variants with Gating. We find TFLA is really fast! 🧵(1/11)

344

47,947

Anamaria-Roberta Hartl ✈️ ICML

Maximilian Beck retweeted

Anamaria-Roberta Hartl ✈️ ICML @anamariarp17

Jun 11

New paper "On Subquadratic Architectures: From Applications to Principles" 🙌 On commonsense & reasoning benchmarks, xLSTM, Mamba-2 & Gated DeltaNet perform nearly indistinguishably. Therefore, we analyse them where structure genuinely matters: on code & time series.👇

5,529

Akarsh Kumar

Maximilian Beck retweeted

Akarsh Kumar

@akarshkumar0101

Jun 7

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable). What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t 1}) → m_{t 1} labels. We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients. Website: akarshkumar.com/smt/ arXiv: arxiv.org/abs/2606.06479

119

784

173,116

Tilde

Maximilian Beck retweeted

Tilde

@tilderesearch

Jun 2

x.com/i/article/206175093417…

362

89,422

Lukas Aichberger

Maximilian Beck retweeted

Lukas Aichberger @aichberger

Jun 1

We unlocked the working memory of LLMs 💥 Reasoning in Memory (RiM) replaces autoregressive "thinking out loud" with fixed memory blocks that form a task-specific workspace for latent reasoning. The key idea is simple: reasoning should happen inside the LLM, not in its output!

314

57,492

Maximilian Beck

Maximilian Beck @maxmbeck

May 14

Life update: A few weeks ago, I moved to Paris 🇫🇷 to start a new position as AI Scientist at Meta FAIR. I am excited about this new chapter and look forward to the opportunities ahead.✨

1,633

Ai2

Maximilian Beck retweeted

Ai2

@allen_ai

Apr 30

Recipes for teaching language models to handle long inputs don't work equally well across model families. We wanted to know why—is it the architecture, the training data, or both? 🧵

25,463

Günter Klambauer

Maximilian Beck retweeted

Günter Klambauer @gklambauer

May 1

# GREAT news!!! 4 papers from our group got accepted at ICML 2026!!! # - 🧬 Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design - 🔁 xLSTM Distillation: Achieving Teacher-Student Parity Through Efficient Hybrid Architectures

2,923

Sepp Hochreiter

Maximilian Beck retweeted

Sepp Hochreiter @HochreiterSepp

Apr 21

RNNs like xLSTM with vertically chunked inference strategy for efficient memory: arxiv.org/abs/2604.18199 Chunking enables a linear-time and constant-memory like TFLA for xLSTM arxiv.org/abs/2503.14376 To chunk blocks via recurrent updates and speed up computation considerably.

Linear-Time and Constant-Memory Text Embeddings Based on Recurrent...

Transformer-based embedding models suffer from quadratic computational and linear memory complexity, limiting their utility for long sequences. We propose recurrent architectures as an efficient...

arxiv.org

9,026

Maximilian Beck

Maximilian Beck @maxmbeck

Apr 12

We’ve released 35 xLSTM checkpoints from our scaling law study, spanning 160M to 7B parameters and trained on 3B - 1.5T tokens from the DCLM dataset. huggingface.co/NX-AI/xlstm_s…

NX-AI/xlstm_scaling_laws · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Maximilian Beck @maxmbeck

3 Oct 2025

🚀 Excited to share our new paper on scaling laws for xLSTMs vs. Transformers. Key result: xLSTM models Pareto-dominate Transformers in cross-entropy loss. - At fixed FLOP budgets → xLSTMs perform better - At fixed validation loss → xLSTMs need fewer FLOPs 🧵 Details in thread

116

12,404

Maximilian Beck

Maximilian Beck @maxmbeck

Apr 12

These checkpoints come from our token-per-parameter training setup and are fully compatible with the xLSTM-7B Hugging Face implementation: huggingface.co/NX-AI/xLSTM-7…

NX-AI/xLSTM-7b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

337

Maximilian Beck

Maximilian Beck @maxmbeck

Apr 12

If you want to know more, visit our poster at ICLR: iclr.cc/virtual/2026/poster/…

302

Maximilian Beck

Maximilian Beck @maxmbeck

Mar 27

👨‍🎓Last week, I successfully defended my PhD thesis - an incredibly exciting and rewarding milestone after 3.5 years of work on xLSTM: Recurrent Neural Network Architectures for Scalable and Efficient Large Language Models

138

8,662

more replies

Maximilian Beck

Maximilian Beck @maxmbeck

Mar 27

And of course many thanks to @KorbiPoeppel for being an amazing co-author on nearly all xLSTM papers. I also want to thank all collaborators, friends, and family for their support.🤗

346

Maximilian Beck

Maximilian Beck @maxmbeck

Mar 27

Now, I’m looking forward to a relaxing Easter break and I’m excited for what comes next 🚀 📄 Thesis: maxbeck.ai/resources/phd_the… 🎤 Defense slides: maxbeck.ai/resources/talks/2…

977

Maximilian Beck

Maximilian Beck @maxmbeck

Mar 21

Looks Great ! 🔥 Thanks for adding @rasbt

Sebastian Raschka

@rasbt

Mar 20

Replying to @maxmbeck

Added ✅ sebastianraschka.com/llm-arc… Thanks again!

553

Niklas Schmidinger

Maximilian Beck retweeted

Niklas Schmidinger

@smdrnks

Mar 17

Excited to share our new paper: Effective Distillation to Hybrid xLSTM Architectures. TL;DR: we retrofit / graft / distill / linearize Transformers into xLSTM-SWA hybrids with fixed-size states. This gives a practical path to studying linear and hybrid architectures starting from already strong pretrained models.

Sepp Hochreiter @HochreiterSepp

Mar 17

xLSTM Distillation: arxiv.org/abs/2603.15590 Near-lossless distillation of quadratic Transformer LLMs into linear xLSTM architectures enables cost- and energy-efficient alternatives without sacrificing performance. xLSTM variants of instruction-tuned Llama, Qwen, & Olmo models.