A big issue we had when serving ColQwen is the non-deterministic output embeddings. More specifically, the embeddings produced for the same images would differ when batch sizes changed at inference, leading to non-zero performance variations. This was surprising to us...
I triple checked the padding, looked for LoRA shenanigans, spent hours running tests, and at the end determined differences stemmed from the backbone model Qwen, and notably, the attention kernel. I have to admit I was a bit reassured the "bug" was not my fault but the error still persisted and "floating point precision errors" was quite an unsatisfying explanation to me...
In the end, I dropped it and just built a somewhat efficient API server that processed images with batch size 1 (pretty useful for other aspects - custom resolution, less CPU bottleneck) but I kept the frustration...
Reading this truly made me realize I still had a ton to learn and gave me some much appreciated answers !
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”
We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly.
The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains.
thinkingmachines.ai/blog/def…