Llama 3.1 405B could be the catalyst for much greater
#AMD adoption for AI inference 📈
@AMD's MI300X may be uniquely suited to cost-effective Llama 3.1 405B inference. Its 192GB of memory allows a single 8xMI300X node to serve Llama 3.1 405B in its native FP16 precision - whereas two 8xH100 nodes are required on
@nvidia.
As we have previously covered, a single NVIDIA 8xH100 node only has 640GB of memory - not enough to hold Llama 3.1 405B’s full 810GB of FP16 weights in memory at once. This means that providers are forced to deploy two 8xH100 nodes with interconnect to serve 405B in FP16 precision, forcing them to accept a significant cost and complexity penalty.
Nvidia’s future H200 and B100 come with 141GB and 192GB of high bandwidth memory respectively - but unlike those, AMD MI300X is available now.
@LisaSu noted on AMD’s Q2 earnings call that AMD was demand-limited on MI300X for the remainder of 2024. Will Llama 3.1 405B alone flip that narrative?
We are starting to see adoption and support increase. Both
@FireworksAI_HQ and
@LeptonAI are hosting Llama 3.1 405B on AMD MI300X chips. They stand out as the lowest cost providers of Llama 3.1 405B. However, it is important to note they are serving the model at FP8 and INT8 precision respectively.
Furthermore, projects like GPU.cpp from
@answerdotai (
@jeremyphoward,
@austinvhuang) are making it easier than ever to write and run portable code across different chip (hardware & software) architectures - decreasing the CUDA lock-in.
What is your view? Long
#AMD?