~90% of AI’s energy is devoted to one core, expensive operation: matrix multiplication.
This partly explains why so much research and engineering on LLM inference targets reducing the IO and complexity of MatMuls: O(n³) FLOPs to process only O(n²) data. When n = 30K, the gap is enormous...
A question I find fascinating is whether there exist alternative bilinear operators f(W, X) for “mixing” LLM tokens, that are as “expressive” as MatMul for human-related tasks, but require ≪ n³ FLOPs (and ideally O(n²) IOs) to evaluate.
There have been significant attempts in this direction, starting with Google’s FNet, and continuing with a long line of work on other bilinear operators: tensor algebras, Hadamard products, fast polynomial multiplication / convolutions, Butterfly matrices and other structured operators.
These approaches demonstrated promising results on some benchmarks, but when scaled to SoTA LLMs, they all seem to suffer from substantial accuracy degradation.
Is there something “holy” about MatMul in deep learning after all?
What seems to distinguish MatMul from many other bilinear forms is that it encodes arbitrary change of basis.
This is a qualitative property: A fundamental reason why ML algorithms work is the premise that real-world datasets have underlying structure – a basis/representation in which the data becomes sparse, invariant, clustered, low-dimensional, etc.
For example, changing to Fourier bases such as the DFT reveals frequencies. The DCT compresses images. Wavelets localize scale. Learned projections encode invariant subspaces, task-relevant metrics, and directions of variation. Attention scores depend on the learned embedding of LLM tokens, which are a choice of basis.
Oversimplifying, learning can be viewed as the optimization problem of finding the “right” basis to represent the data. MatMuls are the computational primitive that switches between these bases, i.e. linear projections.
By contrast, other bilinear operators such as elementwise multiplication, or convolution / polynomial multiplication p·q, do not obviously have this same functional interpretation. They impose structure, but they do not represent an arbitrary change of coordinates.
This observation may help clarify what properties would need to be preserved if we are ever to find cheaper bilinear alternatives to MatMuls in deep learning.
I still find the above question intriguing, both in theory and practice…