I give TensorRT (1) a try today for fp8/int8 quantization and think OneDiff (2) still has decent edge that easily gives me similar speed as
fal.ai out of the box.
Results look promising from TensorRT's blog posts, but it's a typical large company workflow vs. open source solutions people vote with their feet..... You got to install a number of Nvidia deps on linux, calibrate with a few thousand prompts, export an onnx file, build a tensorrt engine, do graph surgery on the onnx file, generate a trt plan that took ~10 mins, and swap that in to another trt specific inference script ...
Meanwhile my earlier OneDiff prototype just does dynamic tracing, produces reusable graph ~1min, with matching perf, can do quick dynamic lora swapping and even support comfyui.
At high level, these frameworks aim to do the same thing, basically produce a compiled, fused, layout optimized computation graph for diffusion models. But
large company and open source project just have different tastes on simplicity and UX when facing tradeoffs on best benchmarked performance.
1)
github.com/NVIDIA/TensorRT-M…
2)
github.com/siliconflow/onedi…