When using distributed inference techniques like TP, DP, PP, and EP in frameworks such as vLLM, SGLang, and TensorRT-LLM, this is how GPU communication happens.
At the software layer, frameworks rely on NCCL collectives such as AllReduce, AllGather, ReduceScatter, Broadcast, and Send/Recv. Under the hood, NCCL uses the available communication fabric:
• NVLink (or PCIe) for GPU-to-GPU communication within a node
• InfiniBand for communication across nodes
Having this mental model makes it much easier to understand distributed inference.