🔥 Introducing LongCat-Next: A Discrete Native Autoregressive Multimodal Model
LongCat-Next integrates language, vision, and audio into a unified discrete autoregressive model, extending Next-Token Prediction to native multimodality and delivering industrial-strength performance across diverse multimodal domains.
🔑 Key Features:
⚙️ 68.5B total params, 3B active, LongCat-Flash-Lite MoE backbone, excels at seeing, painting, and speaking in a unified discrete autoregressive framework.
🧩 Discrete Native Autoregression Paradigm (DiNA): We introduce DiNA, a unified paradigm that extends next-token prediction from language to native multimodality, internalizing diverse modalities into a shared discrete token space.
🌐 Discrete Native Any‑Resolution Vision Transformer (dNaViT): A unified visual tokenizer and de-tokenizer that encodes images into discrete IDs with semantic completeness, enabling both understanding and generation at any resolution. This approach overcomes the performance ceiling of discrete vision modeling in understanding tasks and enables to reconcile the conflict between understanding and generation.
👀 Visual Understanding: Fine-grained visual perception for complex tasks such as OCR, Charts, GUI interpretation, and document analysis, and advanced STEM reasoning capabilities.
🎨 Visual Generation: Generation under 28x compression ratio at arbitary resolution with competitive performance, especially in text rendering.
🎧 Speech: Strong audio comprehension capabilities, low-latency and intelligent audio-to-audio interaction, as well as speech synthesis featuring customizable voice cloning.
📄 Paper:
github.com/meituan-longcat/L…
🔗 GitHub:
github.com/meituan-longcat/L…
😊 HuggingFace:
huggingface.co/meituan-longc…
💻 demo:
longcat.chat/longcat-next
📖 blog:
longcat.chat/longcat-next/in…