🧠 Run LLama 3 70B on a Single 4GB GPU - with airllm and layered inference 🔥
layer-wise inference is essentially the "divide and conquer" approach
📌 And this is without using quantization, distillation, pruning or other model compression techniques
📌 The reason large language models are large and occupy a lot of memory is mainly due to their structure containing many “layers.”
An LLM starts with an embedding projection layer, followed by numerous transformer layers, all identical.
A 70B model has as many as 80 layers. But during inference, each layer is independent, relying only on the output of the previous layer.
Therefore, after running a layer, its memory can be released, keeping only the layer’s output. Based on this concept, AirLLM has implemented layered inference.
How ❓
During inference in a Transformer-based LLM, layers are executed sequentially. The output of the previous layer is the input to the next. Only one layer executes at a time.
Therefore, it is completely unnecessary to keep all layers in GPU memory. We can load whichever layer is needed from disk when executing that layer, do all the calculations, and then completely free the memory after.
This way, the GPU memory required per layer is only about the parameter size of one transformer layer, 1/80 of the full model, around 1.6GB.
📌 Then using flash attention to deeply optimizes cuda memory access to achieve multi-fold speedups
📌 shard model-files by layers.
📌 Use the meta device feature provided by HuggingFace Accelerate. When you load a model via meta device, the model data is not actually read in, only the code is loaded. Memory usage is 0.
📌 Provides options for doing quantization with a `compression` param
`compression`: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization