Actually, comparing 1-bit with 16-bit has no sense. Everyone is using 4-bit weights with MLX. And the speed will be around 150-180 tok/s on M4 Pro. Moreover, 4-bit quantization in MLX can be done as block quantization what preserve quality for the most cases.
1-bit Bonsai 8B running locally on an M4 Pro (MLX) alongside a standard 16-bit 8B model.
Same class of model, very different deployment profile: far lower memory use and substantially higher throughput.