Alican Kiraz

Alican Kiraz

Users
Tweets

Alican Kiraz

@AlicanKiraz0

Jun 15

Bölüm 3: GPU Kütüphane Desteği - Finetuning İşte hepimizin ilgili olduğu ve rakamlardan kopuan nokta burası... Burada kısaca ne yapıp yapamayacağınızı bahsedeceğim. Çünkü aşağıdaki metrik zaten herşeyi özetliyor. Finetuning'de bir modeli LoRA ile finetune ederken modelin yüzde 0.5 - 1 arasında bir parametresini finetune etmek ve bunu 4096 context length ile yapmak istediğinizde batch size gibi değerleri en düşüğe aldığınızda BF16 ile Model'in parametresi x 2.5-3.5 katı kadar GB olarak Vram'de yer kaplıyor. Örneğin: - Qwen3.5-14B neredeyse BF16 finetuning'de Vram'de 50-60gb civarında yer alıyor, buna checkpoint'ler eklerseniz 55-70 arası sıklığa göre değişiyor. İşte burada devreye quantizasyon ile finetuning giriyor; bitsandbytes... O nedenle şunu diyebiliriz ki 35B gibi bir modeli efektif olarak AMD ve MLX'de sağlıklı eğitmeniz çok çok zor. - Aynı zamanda oldu ki transformers ile eğitim vermek isterseniz PyTorch, PEFT, TRL vb yok. Uyarlasanız bile oldu ki çok zor. - Finetune başlarsanız FlashAttention gib hızlandırıcılar AMD ve MLX'de yine sağlıklı değil... Multi-node, ki en önemlisi! Zaten Vram'i önemsemeseniz bu cihazların almazsınız. Ki oldu ki aldınız mutlaka daha fazlasına ihtiyacınız olacak. O anda devreye artık ethernet yada RDMA chain etmek değil tensor paralellizm ve DDP gib dağıtık ram kullanımı isteceksiniz. İşte burda ipler kopuyor... Çünkü 600 GB bandwidth'e sahip M chipler 80GBs thunderbolt'a takılıyor, AMD 10GBs ethernete takılıyor... Nvidia burda benimde en çok tercih etmemde etkili olan QSFP56 ile 200Gbs kullanıyor ve bu çok değerli. Çünkü bandwidth'e çok yakın olduğundan darboğaz olmadan DDP ile inference ve sağlık finetuning yapma imkanı veriyor.

5,832

Zhi Rui Tam

s3nh retweeted

Zhi Rui Tam

@zraytam

Jun 14

Day 2 of continuing working on what Fable 5 coded Muon for bitsandbytes: - It finished Muon-32bit, 8bit using Tri Dao's Gram Newton-Schulz with just 2 shot However after it got yanked in the middle of writing, I also want to add FSDP-2 support. Which is a bit harder than expected. Unlike AdamW which is elementwise (so DP is easy), Muon's needs the full 2D weight matrix. Took several prompts with Opus 4.8 co-work with GPT-5.5 xhigh to make it work ( I think ) Bonus: 8bit momentum tracks 32bit basically perfectly over my sweep, and nf4/fp4/nvfp4 are indistinguishable

445

DV8FromTheCode

DV8FromTheCode @DV8FromTheCode

Jun 13

🟢 LIVE NOW 🟢 PRAGMATA — Protocol: Leveling Up continues. Wrapping up the Lunum Mines and pushing into the next level. ⚡ twitch.tv/DV8FromTheCode #Pragmata #Capcom #TwitchStreamer #LiveNow #BitsAndBytes

dv8fromthecode - Twitch

I'm DV8FromTheCode, I liked to play games like Minecraft, Satisfactory and more. My goal is to build up a community and just have fun. If you like my stream, make sure to follow.

twitch.tv

Anagha Agile Systems

Anagha Agile Systems

@aasaitech

Jun 13

x.com/i/article/206428232519…

103

Anagha Agile Systems

Anagha Agile Systems

@aasaitech

Jun 13

⚡ Quantization, Distillation & Model Compression — the final practical layer that makes powerful LLMs viable for real-world industrial deployment. Just read this excellent technical white paper from @aasaitech on turning massive models into efficient, edge-ready systems without sacrificing too much intelligence. Key highlights: • 4-bit quantization (GPTQ, AWQ, GGUF, bitsandbytes) as the sweet spot for production • Knowledge distillation: teacher → student for smaller, faster specialized models • Complementary techniques: pruning, LoRA/QLoRA, sparsity • Industrial wins: lower latency/memory on factory hardware, cost reduction, energy-efficient edge orchestration, faster inference on industrial PCs Essential for scaling AI across manufacturing floors, maintenance copilots, robotics, and resource-constrained environments. Full white paper infographic: x.com/aasaitech/status/20653… How are you approaching model compression in your deployments — 4-bit quantization, distillation pipelines, or full QLoRA workflows? #Quantization #ModelCompression #KnowledgeDistillation #LLMOptimization #IndustrialAI #EdgeAI #AgenticAI

Anagha Agile Systems

@aasaitech

Jun 12

x.com/i/article/206394052536…

Víctor Cavero

Víctor Cavero

@vcaverog

Jun 12

x.com/i/article/206555257918…

1,224

anonimo

anonimo

@anonimo1is

Jun 12

Modelos abiertos: instalación, optimización y personalización Aquí es donde realmente tienes el control: Inferencia local eficiente con Ollama, vLLM, LM Studio y Open WebUI. Cuantización con GGUF, AWQ, GPTQ y bitsandbytes. Aprende sus diferencias y cuándo utilizar cada opción. Fine-tuning eficiente: QLoRA Unsloth, una de las mejores combinaciones actuales por su velocidad y bajo consumo de VRAM. Adaptadores LoRA. Preparación de datasets de calidad. Técnicas más avanzadas: SFT → DPO/GRPO, según el caso. Deployment y serving optimizado. Con esto puedes tener modelos personalizados que funcionen localmente, sean privados y estén adaptados exactamente a tu dominio.

101

precis0x

precis0x

@precisox

Jun 12

4. Modelos abiertos: instalación, optimización y personalización Aquí es donde realmente tienes control: - Inferencia local eficiente: Ollama, vLLM, LM Studio, Open WebUI. - Cuantización: GGUF, AWQ, GPTQ, bitsandbytes. Aprende las diferencias y cuándo usar cada una. - Fine-tuning eficiente: - QLoRA Unsloth (actualmente una de las mejores combinaciones por velocidad y bajo consumo de VRAM). - LoRA adapters. - Preparación de datasets de calidad. - Técnicas más avanzadas: SFT → DPO/GRPO según el caso. - Deployment y serving optimizado. Con esto puedes tener modelos personalizados que funcionan localmente, son privados y están adaptados exactamente a tu dominio.

1,155

Anagha Agile Systems

Anagha Agile Systems

@aasaitech

Jun 12

x.com/i/article/206394052536…

173

Hypnagogia enjoyer

Hypnagogia enjoyer @sapientwilight

Jun 11

i just found out bitsandbytes works on rocm now. No more nvidia monopoly on easy quantized finetuning!!

h100envy

Rahul S retweeted

h100envy

@h100envy

Jun 8

Tim Dettmers wrote bitsandbytes and QLoRA, the reason you can run and fine-tune a serious AI model on one consumer GPU instead of a server farm. 2.2 million installs a month. Almost every open-source model you've run locally passed through his code. His trick: squeeze a model to 4-bit and lose almost nothing. He fine-tuned a 65B model on a single GPU and matched full precision. Everyone rents compute from a lab. He's the reason you don't have to.

58:18

h100envy

@h100envy

Jun 7

x.com/i/article/206334778284…

6,196

DV8FromTheCode

DV8FromTheCode @DV8FromTheCode

Jun 9

🟢 LIVE NOW 🟢 For Karl! ⛏️💀 Deep Rock Galactic — diving in and not coming up until every dwarf makes it home. ⚡ twitch.tv/DV8FromTheCode #DeepRockGalactic #DRG #TwitchStreamer #LiveNow #BitsAndBytes

dv8fromthecode - Twitch

I'm DV8FromTheCode, I liked to play games like Minecraft, Satisfactory and more. My goal is to build up a community and just have fun. If you like my stream, make sure to follow.

twitch.tv

尼古拉斯定投

尼古拉斯定投

@Nicolas_DCA

Jun 9

Replying to @h100envy

没有bitsandbytes，我现在大概还在用CPU跑7B模型，效率差了几十倍。这才是真正改变游戏规则的人。

witcheer

witcheer

@witcheer

Jun 7

x.com/i/article/206316831092…

5,288

Python Developer

Python Developer

@PythonDvz

Jun 7

𝐓𝐡𝐞 𝐀𝐈 𝐣𝐨𝐛 𝐦𝐚𝐫𝐤𝐞𝐭 𝐞𝐱𝐩𝐥𝐨𝐝𝐞𝐝 300% 𝐥𝐚𝐬𝐭 𝐲𝐞𝐚𝐫. 𝐁𝐮𝐭 90% 𝐨𝐟 "𝐀𝐈 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐬" 𝐰𝐚𝐬𝐡 𝐨𝐮𝐭. 𝐖𝐡𝐲? 𝐍𝐨 𝐫𝐨𝐚𝐝𝐦𝐚𝐩. 𝐈 𝐛𝐮𝐢𝐥𝐭 𝐦𝐲 𝐜𝐚𝐫𝐞𝐞𝐫 𝐟𝐫𝐨𝐦 𝐳𝐞𝐫𝐨. 𝐇𝐢𝐫𝐞𝐝 𝐚𝐭 𝐅𝐀𝐀𝐍𝐆 𝐢𝐧 18 𝐦𝐨𝐧𝐭𝐡𝐬. 𝐇𝐞𝐫𝐞'𝐬 𝐭𝐡𝐞 𝐞𝐱𝐚𝐜𝐭 10-𝐬𝐭𝐞𝐩 𝐩𝐚𝐭𝐡. 𝐅𝐨𝐥𝐥𝐨𝐰 𝐢𝐭. 𝐎𝐰𝐧 𝐢𝐭. → Step 1: Python Foundations Master Python, Jupyter Notebook, VS Code or PyCharm, Git. Code daily. → Step 2: Maths & Statistics for AI Use NumPy, SciPy, SymPy. Learn via Khan Academy, 3Blue1Brown videos. → Step 3: Machine Learning Algorithms Dive into scikit-learn, pandas, matplotlib/seaborn, XGBoost/LightGBM. Build predictors. → Step 4: Deep Learning Foundations Grasp PyTorch, TensorFlow, Keras. Track with Weights & Biases. → Step 5: Natural Language Processing Work with spaCy, NLTK, Hugging Face, gensim. Process text like a pro. → Step 6: Transformers & LLM Architectures Leverage Hugging Face Transformers, PyTorch Lightning, ONNX Runtime, OpenAI API. → Step 7: Fine-Tuning & Custom Model Training Fine-tune via Hugging Face, DeepSpeed, BitsAndBytes. Log with Weights & Biases, MLflow. → Step 8: LangChain Framework Build chains using LangChain, OpenAI API, Google Gemini, Pinecone, ChromaDB. → Step 9: LangGraph & RAG Systems Create graphs with LangGraph, LlamaIndex, Redis, Weaviate, FAISS. → Step 10: MCP & Agentic AI Systems Deploy agents: OpenAI MCP, CrewAI, AutoGen, Anthropic MCP.

168

9,004

driss guessous

driss guessous @drisspg

Jun 4

I am trying to make ideogram usable on my spark; Problem 1. github.com/ideogram-oss/ideo… Problem 2. Bitsandbytes is unbelievable slow

Speed up quantized transformer loading by drisspg · Pull Request #9 · ideogram-oss/ideogram4

Summary I was going through the example and on my spark it takes ~2 minutes to get to first image. Using the run_inference.py entry point. Profiling showed 0.0s start 0.5s state dict fi...

github.com

1,926