Wrote an annotated Triton implementation of Flash Attention 2. (Links in reply)
This is based on the flash attention implementation by the Triton team. Changed it to support GQA and cleaned up a little bit.
Check it out to read the code for forward and backward passes along with the math and derivations. Hope this helps understand transformer attention and flash attention better.
There's about 60 more annotated deep learning paper implementations on this website.
ALT The annotated code. Click on math symbols or identifiers to highlight them.
Letting co-pilot comment on my pull request and then replying to those comments and resolving them makes me feel like a Schizophreniac . But honestly, some of the suggestions are legit useful, so I’m just gonna keep doing that.
docker is supposed to solve the "works on my machine" problem
but often I find that it just adds another layer to the "works on my machine" problem esp if you use a mac
I need to let a LLM "talk" to swift core data. Need a language both the DB and the LLM talks so the obvious solution is SQL. SQL won't work on a key value store though. I wonder how hard would it be to write a SQL like driver for core data.
#DoingdumbShitTillImNotDumbAnymore
OpenAI o3 is 2727 on Codeforces which is equivalent to the #175 best human competitive coder on the planet.
This is an absolutely superhuman result for AI and technology at large.
Now Our visualization library Inspectus can visualize values related to tokens in LLM outputs. This demo shows some outputs from using entropyx (by @_xjdr) on Llama 3. Had fun making this. (jk I didn’t)
🔗👇
We’ve open-sourced our LLM attention visualization library. It generates interactive visualizations of attention matrices with just a few lines of Python code in notebooks.
@luck_not_shit cleaned up and polished the existing code to make it open source.