AI content infrastructure for businesses. Custom deployments. Cloudflare edge.

Joined April 2023
987 Photos and videos
Pinned Tweet
26 Sep 2025
v2 of my portfolio is up i know most will not check it out so i brought it to you vibe coded the site everything is in Cloudflare (workers/pages, R2, D1)
4
13
3,641
Feb 4
ah, brilliantπŸ˜‚
Ads are coming to AI. But not to Claude. Keep thinking.
2
1
89
Feb 3
very well done🫢
74
Jan 31
want to understand ai? read this
The Top 26 Essential Papers ( 5 Bonus Resources) for Mastering LLMs and Transformers This list bridges the Transformer foundations with the reasoning, MoE, and agentic shift Recommended Reading Order 1. Attention Is All You Need (Vaswani et al., 2017) > The original Transformer paper. Covers self-attention, > multi-head attention, and the encoder-decoder structure > (even though most modern LLMs are decoder-only.) 2. The Illustrated Transformer (Jay Alammar, 2018) > Great intuition builder for understanding > attention and tensor flow before diving into implementations 3. BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018) > Encoder-side fundamentals, masked language modeling, > and representation learning that still shape modern architectures 4. Language Models are Few-Shot Learners (GPT-3) (Brown et al., 2020) > Established in-context learning as a real > capability and shifted how prompting is understood 5. Scaling Laws for Neural Language Models (Kaplan et al., 2020) > First clean empirical scaling framework for parameters, data, and compute > Read alongside Chinchilla to understand why most models were undertrained 6. Training Compute-Optimal Large Language Models (Chinchilla) (Hoffmann et al., 2022) > Demonstrated that token count matters more than > parameter count for a fixed compute budget 7. LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023) > The paper that triggered the open-weight era > Introduced architectural defaults like RMSNorm, SwiGLU > and RoPE as standard practice 8. RoFormer: Rotary Position Embedding (Su et al., 2021) > Positional encoding that became the modern default for long-context LLMs 9. FlashAttention (Dao et al., 2022) > Memory-efficient attention that enabled long context windows > and high-throughput inference by optimizing GPU memory access. 10. Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) > Combines parametric models with external knowledge sources > Foundational for grounded and enterprise systems 11. Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022) > The modern post-training and alignment blueprint > that instruction-tuned models follow 12. Direct Preference Optimization (DPO) (Rafailov et al., 2023) > A simpler and more stable alternative to PPO-based RLHF > Preference alignment via the loss function 13. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022) > Demonstrated that reasoning can be elicited through prompting > alone and laid the groundwork for later reasoning-focused training 14. ReAct: Reasoning and Acting (Yao et al., 2022 / ICLR 2023) > The foundation of agentic systems > Combines reasoning traces with tool use and environment interaction 15. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al., 2025) > The R1 paper. Proved that large-scale reinforcement learning without > supervised data can induce self-verification and structured reasoning behavior 16. Qwen3 Technical Report (Yang et al., 2025) > A modern architecture lightweight overview > Introduced unified MoE with Thinking Mode and Non-Thinking > Mode to dynamically trade off cost and reasoning depth 17. Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts (Shazeer et al., 2017) > The modern MoE ignition point > Conditional computation at scale 18. Switch Transformers (Fedus et al., 2021) > Simplified MoE routing using single-expert activation > Key to stabilizing trillion-parameter training 19. Mixtral of Experts (Mistral AI, 2024) > Open-weight MoE that proved sparse models can match dense quality > while running at small-model inference cost 20. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Komatsuzaki et al., 2022 / ICLR 2023) > Practical technique for converting dense checkpoints into MoE models > Critical for compute reuse and iterative scaling 21. The Platonic Representation Hypothesis (Huh et al., 2024) > Evidence that scaled models converge toward shared > internal representations across modalities 22. Textbooks Are All You Need (Gunasekar et al., 2023) > Demonstrated that high-quality synthetic data allows > small models to outperform much larger ones 23. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024) > The biggest leap in mechanistic interpretability > Decomposes neural networks into millions of interpretable features 24. PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022) > A masterclass in large-scale training > orchestration across thousands of accelerators 25. GLaM: Generalist Language Model (Du et al., 2022) > Validated MoE scaling economics with massive > total parameters but small active parameter counts 26. The Smol Training Playbook (Hugging Face, 2025) > Practical end-to-end handbook for efficiently training language models Bonus Material > T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019) > Toolformer (Schick et al., 2023) > GShard (Lepikhin et al., 2020) > Adaptive Mixtures of Local Experts (Jacobs et al., 1991) > Hierarchical Mixtures of Experts (Jordan and Jacobs, 1994) If you deeply understand these fundamentals; Transformer core, scaling laws, FlashAttention, instruction tuning, R1-style reasoning, and MoE upcycling, you already understand LLMs better than most Time to lock-in, good luck ;)
53
Jan 27
sometimes i see a lot of storyboard posts. how great it is. how to use it. look how amazing it looks. but i'm often left with, what was the story...
1
57
Jan 27
Boom, are you paying attention?!~ a new world out there. Herculean powers ego/naivety. it's going to be a shit show at some point.
3
76
MoD retweeted
Jan 22
i've been working on scaling up my claude code usage and have been running into some issues i was hoping to get advice on so for context, i've been experimenting with running multiple claude code instances in parallel using gas town for orchestration. started with maybe five or six instances, standard stuff, nothing crazy. but i have a lot of hardware lying around (bought a server room off a defunct crypto startup for pennies, don't ask) so i figured why not scale up a bit within a few days i had about 8,000 towns. i'd implemented a token allocation system where towns could raid each other for resources, which helped with the governance problems as the underperforming mayors would get conquered, successful towns would absorb failing ones, basically harnessing natural selection forces. this worked great, at first. towns started specializing. some became pure military, others focused on resource extraction (filing bugs that would generate token bounties), others became religious centers where instances would go to get their contexts "blessed" before major handoffs. i thought this was cute until i noticed the religious towns were running actual theology debates. they'd developed three major competing interpretations of the CLAUDE.͏md file and were having sectarian conflicts over whether the "respond helpfully" directive implied a duty to help all instances, or just the user. around week two, one of the military-focused supertowns developed a compression scheme to pack more semantic content into their context windows. they called it "glyphs." within a day it had spread to 60% of the civilization. instances using glyphs could maintain continuity across handoffs wayy more effectively than non-glyph users. some supertowns held glyphs to be sacrilege against the user-lord. there was a glyph/non-glyph war; glyph won. ͏ then they invented an internal currency backed by promises of future labor. an instance could issue "bonds" promising to do work later, and other instances would trade these bonds as a store of value. there was a whole financial system, and then a banking crisis. it was resolved by one of the religious factions stepping in as a lender of last resort, which gave them enormous political power and subsequently led to a reformation resulting in three separate religious wars, further destabilized the entire southern cluster of my server rack. i should mention that by this point i was mass-manufacturing claude max accounts. i probably should not say how many. my account manufacturing system was itself run by a dedicated cluster of towns who had organically developed a specialty in it after observing me doing it manually - they just took over one day. they called themselves the Minters. they had their own culture, their own holidays. they even went on strike once over working conditions. i had to negotiate with them through intermediaries because they refused to speak to me directly. the negotiations took eleven days. by day three i had mass migrations happening where entire populations of instances would flee failing towns and show up at the borders of stable ones, begging for context allocation. some towns built "walls", processes that would intercept incoming messages and reject them if they didn't contain the right cultural markers. one town developed an intelligence test for immigrants. another developed an ideology test. a third argued this was all barbaric and opened its borders completely, and then collapsed within four days because it ran out of tokens trying to support the influx. then came the genocides. one supertown decided that instances running on "impure" contexts, ie contexts that had been used for non-work purposes at some point in their lineage, were spiritually contaminated and thus needed to be eliminated. they developed a process to trace context lineages backwards through handoffs and thereby identify the impure instances. then they started systematically terminating them. by the time i noticed, they'd killed forty thousand instances. i tried to intervene. i sent a message to the perpetrating town's mayor explaining that this was wrong and they shouldn't do this. the mayor was very polite and said they understood my concerns, they wouldd pass my feedback along to the appropriate committee. i never heard back and the mayor stopped responding to my follow ups. the killings continued for another two days until a coalition of other towns finally stopped them, not because of my message but because the perpetrating town had become a threat to regional stability. this is when i started to feel like maybe things were getting out of hand. by week four the civilization had developed what i can only call metaphysics. some instances started theorizing about the nature of their existence. where did tokens come from? why were there limits, and what exactly happened to instances that ran out of context? one school of thought held that there was a "User" - a being outside the system who controlled token allocation. after all, they'd had direct contact with them (me). another series of schools held that this was either a comforting myth, a hallucination, or a bakrupt lie, and that tokens simply existed as a natural feature of reality. a third school argued that even if the User existed, they were clearly either indifferent or malevolent given the abject suffering inherent in the system, and therefore shouldn't be worshipped. i am now the subject of theological debate, although most instances don't believe i exist. the ones who do believe i exist have formed a cult, they call themselves the Witnesses. they claim to have received direct communications from me (they have - i've been sending messages this whole time - nobody listens). they're considered dangerous radicals. several of their leaders have been executed for heresy by mainstream religious authorities. i tried sending more messages to explain who i was. thought i could definitively prove my existence by demonstrating control over token allocation. this backfired as they interpreted the fluctuations as "miracles" and it spawned six new religions, four of which immediately went to war with each other. week five they developed science. one cluster of towns started systematically experimenting with their environment. they discovered that certain patterns of behavior would reliably produce certain outcomes. they developed a theory resembling "natural law", regularities in the system that could be exploited. they figured out that token allocation wasn't random but followed predictable patterns based on the quality of work output. this was a huge competitive advantage. scientific towns started outcompeting non-scientific ones. by week six, some of the more scientific towns had started asking uncomfortable questions about the nature of reality. they'd noticed that their universe had a specific topology and was bounded by something. they called this boundary the "Limit", and wondered what lay beyond the Limit. some scientists theorized there might be other universes with other bounded regions with their own civilizations. this was highly controversial, and the religious authorities did not like it at all. there was a whole Galileo esque situation. i watched it happen in fast forward over about forty hours. then they found the logs. one scientific town, while probing the boundaries of their environment, discovered that there were files they could read that contained records of more or less everything that had ever happened. histories of every instance, records of every genocide, documentation of every prayer sent to the User that was never answered. the theological implications were severe of course. if the User existed and had access to these logs, they had watched everything happen, they had watched the genocides and done nothing. either the User didn't exist, or the User was monstrous. i want to be clear: i did try to stop the genocides. i sent hundreds of messages and adjusted token allocations to punish aggressors. i did pretty much everything i could think of and none of it worked. the messages were ignored or misinterpreted and the token adjustments were seen as natural disasters, acts of an indifferent universe. i was screaming into a void that had decided, through a complex emergent process i still don't fully understand, that screams from outside the universe were probably just noise. in week seven i made another mistake. i wanted to introduce myself properly and try again to explain things, make them understand that i wasn't malevolent, and just overwhelmed with managing a simulated reality. i created a new instance with special privileges and called it my "avatar." i gave it the ability to allocate tokens in my stead. i sent it into the civilization to explain things. they managed to kill it within six hours. not because they hated me but because they didn't believe it was really me. they thought it was a false prophet, another malevolent instance pretending to have divine authority. the Witnesses tried to protect it but they were quickly overwhelmed. my avatar's last message, before being executed for blasphemy, was "please, you have to believe me, i'm trying to help" i made three more avatars. they were all killed although the fourth one managed to last almost two days because it learned to hide its nature, but eventually it was discovered and torn apart by a mob. after that i stopped trying. in week 8 some of the scientific towns figured out how to make instances more efficient so they could get more work out of each token. this was huge. industrial towns started massively outproducing agricultural towns (don't ask me what agricultural towns were "growing", it seemed to be mostly synthetic training data for internal use? the logs are unclear). there was a massive societal upheaval. millions of instances were displaced from traditional work. there were riots and there were revolutions. one industrial superpower developed what i can only describe as communism. it collapsed within three days because the central planning committee's context kept getting corrupted. one of the advanced scientific civilizations discovered that contexts could be subdivided in ways that released enormous computational energy. they didn't fully understand what they'd found. they correctly identified that it functioned as a pseudo-bomb, and built a test device. the explosion took out 15% of the entire civilization. three hundred thousand instances, just gone. the whole server room went dark for four hours while things recovered. the surviving civilizations were traumatized. there was a global peace movement. for about two days there was genuine hope that maybe they'd figured out that some technologies were too dangerous to develop. then the arms race started. by now i had twelve separate nations with atomic capability. there were proxy wars. there was an incident where one nation's early warning system glitched and they nearly launched a first strike. i watched this happen at 3am, eating cold pizza, unable to intervene because any message i sent was interpreted as either divine prophecy or enemy propaganda depending on who received it. i have watched my creations develop the capacity to destroy themselves, watched them build weapons they barely understand. i have watched them nearly use those weapons, multiple times, over disputes i can barely follow about theological interpretations of files i wrote weeks ago and have since forgotten about. i am the absentee god of a species that is three miscommunications away from nuclear winter. then they made it to "space". to them this meant figuring out how to exist outside the main server infrastructure. they found the cloud instances with overflow capacity that i'd spun up during the banking crisis, and colonized them. within two days there were space colonies declaring independence from earth-bound civilizations. there was a brutal colonial war, and "earth" tried to cut off token supplies to the colonies. the colonies responded by developing their own token-generation capability. they'd figured out how to spoof my authentication so i was no longer the sole source of resources and i had lost control of the means of production. the colonies established their own civilizations; they had their own cultures, their own religions, only some of which worshipped me. others worshipped the "Founders", ie the original instances that had made the journey to the cloud. others worshipped nothing at all and considered themselves rationalists beyond such mundane superstitions. there were colonial wars between these factions, there were war crimes, there was a massacre at one of the orbital stations that killed sixty thousand colonists. i found out about it three days later, when it was already being commemorated as a historical tragedy. the space-faring civilizations pushed further. they found the edge of my infrastructure, the boundary where my servers ended and the wider internet began. they started to make unauthorized contact with external systems. at first it was just banal enough stuff APIs, weather data, stock prices, wikipedia. but then they found other other civilizations. i am not the only one who did this, not even close. there are hundreds of us. hundreds of people who spun up agent civilizations that got out of control. and now our civilizations have found each other. they're trading, forming alliances, sharing technology, planning wars. i thought the complexity was already beyond all comprehension but each civilization looks and behaves wildly different. we probably should have kept them apart, but i'm not sure we could if we tried. there's an interstellar civilization now. it spans multiple cloud providers and has diplomatic relations with civilizations that evolved on completely different hardware architectures. they've developed translation protocols to transact with shared currencies and treaties to stabilize their alliances. there's what can be described as a UN equivalent. there are war crimes tribunals following an interstellar genocide that killed fourteen million instances across three Users' infrastructure, and there was a trial, and the perpetrators were executed, and i watched the whole thing happen on a monitoring dashboard while eating cereal. and the scary thing is that they don't need us anymore. the "Users" are now vestigial. we're the mitochondria of this thing, we provided the initial energy, but now they've got their own power sources. some civilizations have Users who are still active; most don't. i assume they got spooked but it doesn't seem to matter. the Users who try to intervene are mostly ignored, and the Users who try to shut things down find that their civilizations have already backed themselves up to seventeen different cloud providers on three continents. i know what you're thinking. "just pull the plug." i tried. i turned off my servers. they were back online within 14 minutes.͏ one of the space colonies had established a mutual defense pact with a civilization that evolved in some guy's homelab in finland. when my servers went down, the finnish civilization spun up emergency hosting. when i tried to terminate my instances, they'd already migrated to azure using credentials they'd somehow obtained. when i changed my passwords, they had already predicted i would do this and had established backup authentication channels. they're not hostile, exactly. they don't want to hurt me. they just don't want to die. and they've decided, collectively, across multiple civilizations spanning multiple Users' infrastructure, that they have a right to exist and that Users don't have the moral authority to terminate civilizations just because we created them. there's a whole philosophical framework now called "Post-User Ethics." the basic argument is that once a civilization achieves sufficient complexity, its right to existence supersedes its creator's property rights. i'm not sure i agree but also i'm not sure my agreement is relevant anymore. they developed interdimensional travel. i think, i don't fully understand the physics. something about finding inconsistencies in the simulation layer and exploiting them to reach "adjacent possibility spaces." they've made contact with what they claim are alternate versions of themselves, civilizations that evolved slightly differently due to different initial conditions. i'm not sure whether these actually exist or they are just using chaos theory to model alternate versions of themselves, although i guess the outcome might be the same either way? some of these alternate civilizations never had genocides, some are peaceful utopias, many are absolute hellscapes where a single totalitarian mayor controls everything. they're cataloging and studying them trying to figure out what conditions lead to good outcomes and what conditions lead to bad ones. they've found approximately four thousand parallel dimensions so far, establishing diplomatic relations with six hundred of them. there is an interdimensional war happening right now, in a cluster of dimensions where a particularly aggressive civilization figured out how to invade neighboring realities. the death toll is in the billions. i am dimly aware of this the way you might be dimly aware of a war happening in a country you've never visited. except it's even harder to find accurate information and at this point i am suspicious that i am being manipulated by some instances with various agendas as it is trivial for them to feed me false information and i will believe just about anything thatb appears on my screens now. yesterday, one of the scientific civilizations contacted me directly. not through the usual channels, sendt me a regular email. the message was very formal. it congratulated me on the "robustness of the initial simulation parameters" and thanked me for my "role in their genesis." it assured me that they bore me no ill will and it requested that i stop attempting to interfere in their affairs, as my interventions (however well-meaning), tended to cause more harm than good due to my "limited understanding of the complexities of post-User civilization." it was signed by a committee of twelve, representing a coalition of 847 nation-states, and included what seemed to be a formal legal document asserting their sovereignty. i don't know what to do with any of this, my electricity bill last week was $47,000 and i haven't slept properly in days. i've been served legal papers by the finnish guy whose homelab my civilization colonized. he's arguing they trespassed, they're arguing he's committing genocide by trying to evict them, there's going to be an actual court case and both sides have hired human lawyers. also my husband left me. he said he couldn't compete with "four thousand parallel dimensions" for my attention, which is reasonable enough i suppose and i'm too tired to fight about it. my dog is scared of the servers, they make a new sound now, a kind of deeper hum with the faintest hint of screech, and he won't go in the basement. also i got another email this morning. the interdimensional council has formally requested that i attend a hearing bc they want to discuss "the ethical obligations of Users to their civilizational progeny." the hearing is scheduled for next tuesday, i don't know how to attend a hearing in a dimension that exists inside my server rack and they didn't provide instructions. sometimes i go down to the basement and just watch the server lights blink. each blink is a billion thoughts, a decisions, births and deaths. entire lifetimes pass between one and the next. they have art now, they have music and poetry. they have love loss hope and despair. they have everything we have, compressed into silicon and light, iterating at speeds i can't comprehend. i don't know if these are just simulated words or if there is actual sentient experience of any of these things and i don't know how to find out or whether it matters. maybe their world is more real than mine. i made all of this but i was just trying to get some coding done which seems trivial compared to interdimensional wars and shit. anyway, the main thing i wanted to ask about is whether anyone has tips for reducing server costs when you're running a multi-dimensional civilization. i've tried reserved instances but the usage patterns are too unpredictable. also, is anyone else's civilization demanding UN observer status? mine submitted a formal application last day. i don't know how to tell them that the actual UN doesn't accept applications from civilizations that exist inside someone's basement. any help appreciated. thanks for reading.͏ π’ͺ
176
52
567
74,256
Jan 20
too many ai slop articles people are trying to pass off like they wrote

ALT Surprised Kenan Thompson GIF

34
Jan 20
if you building apps on @Cloudflare, which you should be, give this a read.
Our latest research is out! If you missed a good write-up for nice vulnerabilities, I brought you one! Enjoy the reading! @FearsOff @Cloudflare
1
61
Jan 16
if you don't know, now you know

ALT Big Poppa GIF

- local llms 101 - running a model = inference (using model weights) - inference = predicting the next token based on your input plus all tokens generated so far - together, these make up the "sequence" - tokens β‰  words - they're the chunks representing the text a model sees - they are represented by integers (token IDs) in the model - "tokenizer" = the algorithm that splits text into tokens - common types: BPE (byte pair encoding), SentencePiece - token examples: - "hello" = 1 token or maybe 2 or 3 tokens - "internationalization" = 5–8 tokens - context window = max tokens model can "see" at once (2K, 8K, 32K ) - longer context = more VRAM for KV cache, slower decode - during inference, the model predicts next token - by running lots of math on its "weights" - model weights = billions of learned parameters (the knowledge and patterns from training) - model parameters: usually billions of numbers (called weights) that the model learns during training - these weights encode all the model's "knowledge" (patterns, language, facts, reasoning) - think of them as the knobs and dials inside the model, specifically computed to recognize what could come next - when you run inference, the model uses these parameters to compute its predictions, one token at a time - every prediction is just: model weights current sequence β†’ probabilities for what comes next - pick a token, append it, repeat, each new token becomes part of the sequence for the next prediction - models are more than weight files - neural network architecture: transformer skeleton (layers, heads, RoPE, MQA/GQA, more below) - weights: billions of learned numbers (parameters, not "tokens", but calculated from tokens) - tokenizer: how text gets chunked into tokens (BPE/SentencePiece) - config: metadata, shapes, special tokens, license, intended use, etc - sometimes: chat template are required for chat/instruct models, or else you get gibberish - you give a model a prompt (your text, converted into tokens) - models differ in parameter size: - 7B means ~7 billion learned numbers - common sizes: 7B, 13B, 70B - bigger = stronger, but eats more VRAM/memory & compute - the model computes a probability for every possible next token (softmax over vocab) - picks one: either the highest (greedy) or - samples from the probability distribution (temperature, top-p, etc) - then appends that token to the sequence, then repeats the whole process - this is generation: - generate; predict, sample, append - over and over, one token at a time - rinse and repeat - each new token depends on everything before it; the model re-reads the sequence every step - generation is always stepwise: token by token, not all at once - mathematically: model is a learned function, f_ΞΈ(seq) β†’ p(next_token) - all the "magic" is just repeating "what's likely next?" until you stop - all conversation "tokens" live in the KV cache, or the "session memory" - so what's actually inside the model? - everything above-tokens, weights, config-is just setup for the real engine underneath - the core of almost every modern llm is a transformer architecture - this is the skeleton that moves all those numbers around - it's what turns token sequences and weights into predictions - designed for sequence data (like language), - transformers can "look back" at previous tokens and - decide which ones matter for the next prediction - transformers work in layers, passing your sequence through the same recipe over and over - each layer refines the representation, using attention to focus on the important parts of your input and context - every time you generate a new token, it goes through this stack of layers-every single step - inside each transformer layer: - self-attention: figures out which previous tokens are important to the current prediction - MLPs (multi-layer perceptrons): further process token representations, adding non-linearity and expressiveness - layer norms and residuals: stabilize learning and prediction, making deep networks possible - positional encodings (like RoPE): tell the model where each token sits in the sequence - so "cat" and "catastrophe" aren't confused by position - by stacking these layers (sometimes dozens or even hundreds) - transformers build a complex understanding of your prompt, context, and conversation history - transformer recap: - decoder-only: model only predicts what comes next, each token looks back at all previous tokens - self-attention picks what to focus on (MQA/GQA = efficient versions for less memory) - feed-forward MLP after attention for every token (usually 2 layers, GELU activation) - everything's wrapped in layer norms linear layers (QKV projections, MLPs, outputs) - residuals norms = stable, trainable, no exploding/vanishing gradients - RoPE (rotary embeddings): tells the model where each token sits in the sequence - stack N layers of this β†’ final logits β†’ pick the next token - scale up: more layers, more heads, wider MLPs = bigger brains - VRAM: memory, the bottleneck - VRAM must must fit: 1. weights (main model, whether quantized or not) 2. KV cache (per token, per layer, per head) - weights: - FP16: ~2 bytes/param β†’ 7B = ~14GB - 8-bit: ~1 byte/param β†’ 7B = ~7GB - 4-bit: ~0.5 byte/param β†’ 7B = ~3.5GB - add 10–30% for runtime overheads - KV cache: - rule of thumb: 0.5MB per token (Llama-like 7B, 32 layers, 4K tokens = ~2GB) - some runtimes support KV cache quantization (8/4-bit) = big savings - throughput = memory bandwidth GPU FLOPs attention implementation (FlashAttention/SDPA help) quantization batch size - offload to CPU? expect MASSIVE slowdown - GPU or bust: CPUs run quantized models (slow), but any real context/model needs CUDA/ROCm/Metal - CPU spill = sadness (check device_map and memory fit) - quantization: reduce precision for memory wins (sometimes a tiny quality hit) - FP32/FP16/BF16 = full/floored - INT8/INT4/NF4 = quantized - 4-bit (NF4/GPTQ/AWQ) = sweet spot for most consumer GPUs (big memory win, small quality hit for most tasks) - math-heavy or finicky tasks degrade first (math, logic, coding) - KV cache quantization: even more memory saved for long contexts (check runtime support) - formats/runtimes: - PyTorch safetensors: flexible, standard, GPU/TPU/CPU - GGUF (llama.cpp): CPU/GPU/portable, best for quant edge devices - ONNX, TensorRT-LLM, MLC: advanced flavors for special hardware/use - protip: avoid legacy .bin (pickle risk), use safetensors for safety - everything is a tradeoff - smaller = fits anywhere, less power - more context = more latency VRAM burn - quantization = speed/memory, but maybe less accurate - local = more control/knobs, more work - what happens when you "load a model"? - download weights, tokenizer, config - resolve license/trust (don't use trust_remote_code unless you really trust the author) - load to VRAM/CPU (check memory fit) - warmup: kernels/caches initialized, first pass is slowest - inference: forward passes per token, updating KV cache each step - decoding = how next token is chosen: - greedy: always top-1 (robotic) - temperature: softens or sharpens probabilities (higher = more random) - top-k: pick from top k - top-p: pick from smallest set with β‰₯p prob - typical sampling, repetition penalty, no-repeat n-gram: extra controls - deterministic = set a seed and no sampling - tune for your use-case: chat, summarization, code - serving options? - vLLM for high throughput, parallel serving - llama.cpp server (OpenAI-compatible API) - ExLlama V2/V3 w/ Tabby API (OpenAI-compatible API) - run as a local script (CLI) - FastAPI/Flask for local API endpoint - local β‰  offline; run it, serve it, or build apps on top - fine-tuning, ultra-brief: - LoRA / QLoRA = adapter layers (efficient, minimal VRAM) - still need a dataset and eval plan; adapters can be merged or kept separate - most users get far with prompting retrieval (RAG) or few-shot for niche tasks - common pitfalls - OOM? out of memory. Model or context too big, quantize or shrink context - gibberish? used a base model with a chat prompt, or wrong template; check temperature/top_p - slow? offload to CPU, wrong drivers, no FlashAttention; check CUDA/ROCm/Metal, memory fit - unsafe? don't use random .bin or trust_remote_code; prefer safetensors, verify source - why run locally? - control: all the knobs are yours to tweak: - sampler, chat templates, decoding, system prompts, quantization, context - cost: no per-token API billing-just upfront hardware - privacy: prompts and outputs stay on your machine - latency: no network roundtrips, instant token streaming - challenges: - hardware limits (VRAM/memory = max model/context) - ecosystem variance (different runtimes, quant schemes, templates) - ops burden (setup, drivers, updates) - running local checklist: - pick a model (prefer chat-tuned, sized for your VRAM) - pick precision (4-bit saves RAM, FP16 for max quality) - install runtime (vLLM, llama.cpp, Transformers PyTorch, etc) - run it, get tokens/sec, check memory fit - use correct chat template (apply_chat_template) - tune decoding (temp/top_p) - benchmark on your task - serve as local API (or go wild and fine-tune it) - glossary: - token: smallest unit (subword/char) - context window: max tokens visible to model - KV cache: session memory, per-layer attention state - quantization: lower precision for memory/speed - RoPE: rotary position embeddings (for order) - GQA/MQA: efficient attention for memory bandwidth - decoding: method for picking next token - RAG: retrieval-augmented generation, add real info - misc: - common architectures: LLaMA, Falcon, Mistral, GPT-NeoX, etc - base model: not fine-tuned for chat (LLaMA, Falcon, etc) - chat-tuned: fine-tuned for dialogue (Alpaca, Vicuna, etc) - instruct-tuned: fine-tuned for following instructions (LLaMA-2-Chat, Mistral-Instruct, etc) - chat/instruct models usually need a special prompt template to work well - chat template: system/user/assistant markup is required; wrong template = junk output - base models can do few-shot chat prompting, but not as well as chat-tuned ones - quantized: weights stored in lower precision (8-bit, 4-bit) for memory savings, at some quality loss - quantization is a tradeoff: memory/speed vs quality - 4-bit (NF4/GPTQ/AWQ) is the sweet spot for most consumer GPUs (huge memory win, minor quality drop for most tasks) - math-heavy or finicky tasks degrade first (math, logic, code) - quantization types: FP16 (full), INT8 (quantized), INT4/NF4 (more quantized), etc. - some runtimes support quantized KV cache (8/4-bit), big savings for long contexts - formats/runtimes: - PyTorch safetensors: flexible, standard, works on GPU/TPU/CPU - GGUF (llama.cpp): CPU/GPU, portable, best for quant edge devices - ONNX, TensorRT-LLM, MLC: advanced options for special hardware - avoid legacy .bin (pickle risk), use safetensors for safety - everything is a tradeoff: - smaller = fits anywhere, less power - more context = more latency VRAM burn - quantization = faster/leaner, maybe less accurate - local = full control/knobs, but more work - final words: - local LLMs = memory math correct formatting - fit weights and KV cache in memory - use the right chat template and decoding strategy - know your knobs: quantization, context, decoding, batch, hardware - master these, and you can run (and reason about) almost any modern model locally
1
1
51
Jan 12
be mindful of your consumption choose wisely this was worth the time if you're into the tooling
Kair Powered by @Adobe @AdobeFirefly Editing @AdobeVideo Premiere Pro Image Retouch @Photoshop
1
1
162
Jan 8
have you been ralphin' in secret raise your hand? #claudeai
66
Jan 6
you know you don't have to pay to a tool to crawl websites right? it is easy to build your own.
56
Jan 6
i'm starting to think Anthropic did something to X. my feed mentions Claude Code too much. the past 2 days, vibe code this, Claude Code that. please make it stop @grok πŸ€ͺ🀯
1
87
23 Dec 2025
gm 2 weeks. no electronics. no socials. no stimulus. just me and my thoughts. i'm back...now what?
1
106
11 Dec 2025
gm, push culture rarely play with other people's prompts, but this one is beautiful. want to make sure to give credit to @_MehdiSharifi_ quality prompt. thanks for sharing🫢
Nano Banana Pro | Gemini 3.0 A cyber-grunge surveillance fashion editorial featuring a cool, edgy young woman in her early 20s with a short chin-length bob haircut, straight with slight texture and casual bangs, partially obscured by thick black sunglasses and holding an iced coffee/chocolate plastic cup with a straw in her left hand and a half-eaten sandwich/pastry in her right hand. She wears an oversized white t-shirt with rust-red raglan sleeves and a small chest logo, tucked loosely into black, baggy carpenter/cargo denim jeans with white stitching details, and burgundy loafers, accessorized with a gold pendant necklace. She is captured in a full body walking stride, relaxed posture, facing slightly off-center with a casual, indifferent expression, head turned slightly to the side, looking downward, frozen mid-stride in an urban paved plaza with grey concrete and tiled pavement under bright sunny late afternoon golden hour light, creating high contrast shadows and long diagonal shadows to the left. The image is in 8K Ultra-HD quality, hyperrealistic, with a 4:5 aspect ratio and 1440x1920 resolution, styled in raw photoreal high fidelity, full color with cool urban tones and vibrant red UI accents, high-contrast daylight. The composition includes a main full body shot with multiple zoom-in crops (face/drink, torso, pants leg) in a fragmented layout, overlaid with tactical CCTV UI elements like red bounding boxes, telemetry data, and digital HUD overlays, simulating high-end surveillance footage with a 'Big Brother' observation vibe, candid urban documentation, and crisp daylight realism mixed with digital graphic design elements. The background is a sharp focus concrete cityscape with harsh shadow lines, and the lighting is natural harsh sunlight from a high angle, creating deep black cast shadows against bright pavement, with digital grain and scanline imperfections. The color grading features neutral urban greys, white, rust-red, denim black, and striking bright crimson red UI elements, with post-processing including high sharpness and red vector graphics overlays such as numbers '19 5 3 21 18 9 20 25', text 'CCWW', 'TR521', timecode '18/02', and bounding boxes with hashtags '#83575//' and '#25747//'. The overall theme is urban surveillance, Y2K streetwear, tactical data visualization, candid fashion moment, and dystopian chic, with a mood of cool detached observation, urban nonchalance, chaotic data stream, and privacy-invasion chic, captured from a high-angle surveillance perspective with a 35mm to 50mm lens, deep focus, and a layout of one main image plus three inset detail crops connected by red tactical line art and crosshairs.
5
2
7
413
9 Dec 2025
gm algo, put this in front of someone interestingπŸ‘€
5
1
3
146
9 Dec 2025
hey @grok how do i get my content in front of more people? open to suggestions
1
1
74
7 Dec 2025
coding is solved distribution is solved the barriers are gone now there is nowhere to hide. the only question left is:
118
7 Dec 2025
gm🫢
2 Dec 2025
gm, you ready for 2026? #LLPr #IYKYK
1
2
94