Praveen Vaddadi

Praveen Vaddadi

13 Photos and videos

Tweets

Praveen Vaddadi @Densebit

ReLU: mm256_setzero_ps is called outside the loop, providing an instantaneous, memory-free floor function via _mm256_max_ps. GELU: SIMD unroll 0.5 * x * (1 tanh(sqrt(2/pi) * (x 0.044715 * x^3))) algebraic curve func. interleave re im with _mm256_shuffle_ps

Praveen Vaddadi

Praveen Vaddadi @Densebit

rnn state transitions encoded to bitflags; integer underflow (0 - 1 = UINT_MAX) gets us free warp index; branchless, inlining, etc. std. simult. mean var (welfords algo) for layernorm; reuse bufers for backward passes; _mm256_loadu/storeu_ps > memcpy .

Praveen Vaddadi

Praveen Vaddadi @Densebit

for autograd, flag nodes in the DAG so backward traversal returns early (sever branches in backprop). hardcode identity ops t osave cycles. bitshift 16bit sparse tensor indices t opack in uint64. fast sgemm/dgemm. _mm256_xor_ps breaks dep chains.

Praveen Vaddadi

Praveen Vaddadi @Densebit

_mm256_cvtps_ph upasts bf16 to fp32 in register. avoid libm (exp, tanh...) - use poly approx. from computer approximations book. Before exp(), do _mm256_max_ps sweep to subract max value from all elements to ensure the maximum input to exp() is exactly 0.0.

Praveen Vaddadi

Praveen Vaddadi @Densebit

allocate (calculated) abs peak memory required by the mode graph at startup. put a 8/16 byte header before tensor payload carrying meta info like refcounts, dimensions etc. zero copy mmap (serde.h) OR blast from fopen/fseeko... into arena buffers.

Praveen Vaddadi

Praveen Vaddadi @Densebit

int8 quant; inflate to 32float at avx2 exec unit. _mm256_cvttps_epi32 crushes frac data into ints. _mm_packus_epi32 to push back down 16/32 bit products ints. lazy fuse chained ops int osingle loop pass.

Praveen Vaddadi

Praveen Vaddadi @Densebit

mmap based allocator with tensory payloads (2d,3d,4d,..) as flat 1d block (32 byte aligned). buddy alloc is faster. some dead/padding between thread-local vars to avoid false sharing.

Praveen Vaddadi

Praveen Vaddadi @Densebit

a unified tensor abstraction layer (for CPU and optional GPU) with void* buffers and dynamically mapped func pointers so same C DAG routes both execution paths.

Praveen Vaddadi

Praveen Vaddadi @Densebit

thread count = core count persistent thread pool; lock CPU's rounding behavior first (DAZ, FTZ, etc. to system state?)

Praveen Vaddadi

Praveen Vaddadi @Densebit

#notes The anatomy of a tiny neural network compute engine (in C) for CPUbound inference & training. why: pytorch and tf are too heavy for executing sequence models like gpt/rwkv etc. how: a lazy DAG that compiles to fused avx2 with int8 quantization, work-stealing concurrency.

Praveen Vaddadi

Praveen Vaddadi @Densebit

Jun 12

Polyglot wrappers always make good DSLs.

Praveen Vaddadi

Praveen Vaddadi @Densebit

May 30

The problems in tech aren't really about the tech.

Praveen Vaddadi

Praveen Vaddadi @Densebit

May 23

smbc-comics.com/comic/2012-0…

Saturday Morning Breakfast Cereal

SMBC is a daily comic strip about life, philosophy, science, mathematics, and dirty jokes.

smbc-comics.com

Praveen Vaddadi

Praveen Vaddadi @Densebit

May 18

You can write a surprisingly large class of software under 1.44Mb (a floppy): editors, databases, games, even operating systems. Beyond that, it is usually more abstraction paying for excess.

Praveen Vaddadi

Praveen Vaddadi @Densebit

May 8

Natural language is little-endian: lower-significance bits come first as shared context, and higher-significance bits come later as the informative payload.

Gynvael Coldwind

Praveen Vaddadi retweeted

Gynvael Coldwind @gynvael

Mar 24

This 1-pager from Xusheng Li on GDB internals of how watchpoints are implemented is a delight to read! (especially that double-write behaviour false positive - I did not know about that)

361

22,065

Praveen Vaddadi

Praveen Vaddadi @Densebit

Mar 23

github.com/xtellect/spaces is a C allocator with explicit heap regions. It can work as a drop-in allocator, or give each subsystem its own heap, cap it, inspect it, and destroy it in one step.

GitHub - xtellect/spaces: A high-performance C allocator with explicit heap regions, fragmentation...

A high-performance C allocator with explicit heap regions, fragmentation control, and runtime tuning. - xtellect/spaces

github.com

Praveen Vaddadi

Praveen Vaddadi @Densebit

Mar 11

out of ignorance, comes certainty - Anonymous

Praveen Vaddadi

Praveen Vaddadi @Densebit

Mar 8

codegen: ai no longer makes things, it makes things up.

Praveen Vaddadi

Praveen Vaddadi @Densebit

Mar 3

In AI, a fab/hardware bet may actually be cheaper than backing yet another software startup.

Praveen Vaddadi

Praveen Vaddadi @Densebit

Mar 3

Fabs have massive fixed costs but near-zero marginal costs and strong long-term pricing power. AI software is hyper-competitive, supply-constrained today but structurally deflationary. Capital may compound better at the chip layer than in crowded model apps.