Filter
Exclude
Time range
-
Near
14 Jul 2025
🚀Summer Fest Day 3: Cost-Effective MoE Inference on CPU from Intel PyTorch team Deploying 671B DeepSeek R1 with zero GPUs? SGLang now supports high-performance CPU-only inference on Intel Xeon 6—enabling billion-scale MoE models like DeepSeek to run on commodity CPU servers. Key highlights: 1. Full CPU backend for SGLang with Intel AMX 2. Native BF16 / INT8 / FP8 support for both Dense and Sparse FFNs 3. 6–14× TTFT and 2–4× TPOT speedup vs. llama.cpp 4. 85% memory bandwidth efficiency with optimized MoE kernels 5. Flash Attention V2 MLA MoE all optimized for CPU 6. Multi-NUMA parallelism mapped from GPU-style Tensor Parallelism This work is now fully upstreamed to SGLang main—read how we achieved it, and how far you can go without a GPU 👇 #LLMInfra #ModelServing #MoE #Xeon6 #SGLang #FP8 #INT8 #CPUInference
6
15
38
19,174
22 Apr 2025
Great find on Microsoft’s BitNet release — major validation for CPU-native AI! At #Cortensor, we’ve believed from day 1: inclusive AI means CPUs, not just high-end GPUs. BitNet shows we're not alone. #Apple, #Meta, #Microsoft — all aligning. 📄 Paper: arxiv.org/abs/2504.12285 💻 GitHub: github.com/microsoft/BitNet 🧵 Source: x.com/Sumanth_077/status/191… Thanks for sharing this, @VOLKERTS_TRADES It’s $COR or nothing. #Cortensor #CPUInference #InclusiveAI #BitNet #DecentralizedAI

21 Apr 2025
Microsoft released bitnet.cpp: A blazing-fast open-source 1-bit LLM inference framework that runs directly on CPUs. You can now run 100B parameter models on local x86 CPU devices with up to 6x speed improvements and 82% less energy consumption. 100% Open Source
4
7
23
388
19 Oct 2024
𝐍𝐨𝐰 𝐲𝐨𝐮 𝐜𝐚𝐧 𝐫𝐮𝐧 100𝐁 𝐋𝐋𝐌 𝐨𝐧 𝐚 𝐒𝐢𝐧𝐠𝐥𝐞 𝐆𝐏𝐔 Microsoft released open-sourced bitnet.cpp, the official inference framework for 1-bit LLMs on CPUs bitnet.cpp enables running a 100B BitNet b1.58 model on a single CPU. <<<𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬>>> bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. It reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. It achieves speeds comparable to human reading (5-7 tokens per second). bitnet.cpp supports a list of 1-bit models available on Hugging Face. #llms #cpuinference #bitnetllm #nlproc
1
1
6
921