waterloo intern

waterloo intern

27 Photos and videos

Tweets

Pinned Tweet

waterloo intern

@waterloo_intern

Mar 7

- 230 training runs - 1,623 GPU hours (67 B200 days) - 76 TB of training data - a 2x faster model Every paper said it can't be done. Quantization Aware Distillation made it possible.

waterloo intern

@waterloo_intern

Mar 7

x.com/i/article/202980121700…

104

1,199

154,022

waterloo intern

waterloo intern

@waterloo_intern

Jun 5

anyways if you work for an inference lab you get 17x8 B200 nodes to try stuff out ... ~1.2 EFLOPS FP8 for an intern

roon

@tszzl

Jun 4

there is no such thing as running out of compute. for the right price someone will sell you compute. it’s an elastic resource like all other markets. when RSI arrives running that program will be so valuable that all clouds will mostly shut down and sell compute to the singularity

404

50,840

waterloo intern

waterloo intern

@waterloo_intern

Jun 2

ok we did (grudgingly) go outside while the agents worked

waterloo intern

@waterloo_intern

May 31

> drive 4 hours to yosemite > take one picture to say we did it > spend the rest of day in the airbnb > everyone needed to check in on the agents

2,738

waterloo intern

waterloo intern

@waterloo_intern

May 31

> drive 4 hours to yosemite > take one picture to say we did it > spend the rest of day in the airbnb > everyone needed to check in on the agents

15,320

waterloo intern

waterloo intern

@waterloo_intern

May 29

> don't just optimize intra-step of a generative model remove the step completely

Yikai Zhu

@YikaiZhu98

May 29

x.com/i/article/206040169234…

4,272

waterloo intern

waterloo intern

@waterloo_intern

May 29

> go to waterloo > become senior intern > retire on graduation

4,216

waterloo intern

waterloo intern

@waterloo_intern

May 27

if you can't guess the kernel, you're not locked in enough

310

34,182

waterloo intern

waterloo intern

@waterloo_intern

May 23

tech bros deriving islam from first principles

gabriel

@gabriel1

May 22

i challenge anyone who listens to music for 6 hours a day to quit for a week to: 1) realize it's an addiction 2) realize how much better your thoughts become

183

3,118

242,839

waterloo intern

waterloo intern

@waterloo_intern

May 23

x.com/i/article/205255091335…

220

18,837

waterloo intern

waterloo intern

@waterloo_intern

May 17

tokenmaxing

2,507

waterloo intern

waterloo intern

@waterloo_intern

May 8

infra work is interesting

814

waterloo intern

waterloo intern

@waterloo_intern

May 2

x.com/i/article/205045393160…

679

waterloo intern

waterloo intern

@waterloo_intern

Apr 28

this was so fun to work on, i hope you find it useful tried @baseten for GPU access?

Jino Rohit

@jino_rohit

Apr 28

im making a decision to switch to blackwell than hopper since the 5090s are more affordable. i was learning WGMMA and renting h100 was getting too expensive :( what are some affordable options to rent among @vast_ai @modal etc

1,015

waterloo intern

waterloo intern

@waterloo_intern

Apr 3

we dug into 1-bit bonsai with @part_harry_ the grand canyon of a gap they showed... is just THREE (3) points away from normal PTQ but they already knew that here's the graph (fixed)

PrismML

@PrismML

Mar 31

Replying to @PrismML

This scatter plot shows the Pareto frontier of intelligence vs. size, defined by models like Qwen3 0.6B, 1.7B, 4B, 8B, and Ministral3 3B. The 1-bit Bonsai family shifts that frontier dramatically to the left. This changes the tradeoff itself: models no longer have to be large to be capable.

100

17,562

waterloo intern

waterloo intern

@waterloo_intern

Mar 30

on-policy for the student off-policy for the teacher monkey input, monkey output

Harry Partridge

@part_harry_

Mar 30

x.com/i/article/203841175985…

3,955

waterloo intern

waterloo intern

@waterloo_intern

Mar 26

x.com/i/article/203709315350…

629

73,578

waterloo intern

waterloo intern

@waterloo_intern

Mar 11

was skeptical but gave it a shot because @karpathy anyways 2x kernel perf (fp4 matmul) 3 minutes of work (1 prompt) triton beat cutlass (?!)

Jaber

@Akashi203

Mar 11

i open-sourced autokernel -- autoresearch for GPU kernels you give it any pytorch model. it profiles the model, finds the bottleneck kernels, writes triton replacements, and runs experiments overnight. edit one file, benchmark, keep or revert, repeat forever. same loop as @karpathy autoresearch, applied to kernel optimization 95 experiments. 18 TFLOPS → 187 TFLOPS. 1.31x vs cuBLAS. all autonomous 9 kernel types (matmul, flash attention, fused mlp, layernorm, rmsnorm, softmax, rope, cross entropy, reduce). amdahl's law decides what to optimize next. 5-stage correctness checks before any speedup counts the agent reads program.md (the "research org code"), edits kernel.py, runs bench.py, and either keeps or reverts. ~40 experiments/hour. ~320 overnight ships with self-contained GPT-2, LLaMA, and BERT definitions so you don't need the transformers library to get started github.com/RightNow-AI/autok…

243

39,939

waterloo intern

waterloo intern

@waterloo_intern

Mar 12

with cudagraph enabled, the gains are not as dramatic but triton still technically outperformed the cutlass kernel in production

1,204

waterloo intern

waterloo intern

@waterloo_intern

Mar 9

every year i go through a phase where i re-learn eigenvectors

622

waterloo intern

waterloo intern

@waterloo_intern

Mar 7

x.com/i/article/202980121700…

394

189,359