Here are Android AI
@Cellhasher Qwen3.5 DeepSeek 33B benchmarks.
TLDR: its actually worth it if you have old androids for the cheap less than 5w most of these phones operate at. Especially as agents become more deployable and 24/7 hands off or sit and forget. AI model companies will eventually start fine tuning even further to get models into more of a MOE style but tuned directly to the use with Routers, routing each request. Cellhasher is working on this as well. As demand for compute and energy continues, eventually companies will not keep making bigger models and not Company can keep pace with the big ones, they will focus on small local models for retail.
We will start with the 5 year old Chipset and Bring you up to speed with some wild results on a newest-chipset.
Device: Snapdragon 888 (5 years old chip)
CPU only
Non-root (28 GB/s memory cap)
Rooted devices (56 GB/s) scale ~1.8β1.9x.
(using the fastest 4 cores actually outperforms using all 8 cores)
Qwen3.5 - 0.8B
CPU (4 big cores):
12.54β13.01 tok/s
Rooted (1.8β1.9x): 22.6β24.7 tok/s
Vulkan GPU (NGL=24):
1.60β1.78 tok/s (7β8x slower)
Qwen3.5 - 2B
CPU (4 big cores):
9.15β9.36 tok/s
Rooted: 16.5β17.8 tok/s
Vulkan GPU (NGL=24):
0.78β0.86 tok/s (11β12x slower)
Large Models (Non-Root)
#p stands for Number of Phones in a parallel pipeline ring made by Cellhasher Swarm AI
DeepSeek 33B β 5.89 tok/s (Best was 7.8 tok/s) average is still 5.89 tok/s
(12p, d=16, 81% accept)
Qwen3.5-35B-A3B β 3.75 tok/s (best was 5.1 tok/s)
(3p, d=8, 71.5% accept)
Qwen3.5-32B (Coder) β 2.85 tok/s (best 4.8 tok/s)
(7p, same-family draft)
Rooted Estimate (56 GB/s)
DeepSeek 33B β ~10β11 tok/s
Qwen3.5-35B β ~6.5β7 tok/s
Qwen3.5-32B β ~5β5.5 tok/s
Snapdragon 8 Elite Gen 5 plus 24gb RAM (Android Phone)
~75β85 GB/s memory bandwidth
INT4/INT8 NPU usable
Well-tuned pipeline
Qwen3.5 - 0.8B Model
CPU only: 30β45 tok/s
CPU NPU: 70β100 tok/s
Qwen3.5 - 2B Model
CPU only: 18β25 tok/s
CPU NPU: 40β60 tok/s
DeepSeek 33B
CPU only: 15β20 tok/s
CPU NPU (blended): 20β28 tok/s
(30 tok/s possible with ideal tuning)
Qwen3.5-35B-A3B
CPU only: 11β15 tok/s
CPU NPU: 16β22 tok/s
Qwen3.5-32B (Coder)
CPU only: 10β14 tok/s
CPU NPU: 15β20 tok/s
(CPU NPU is slightly tricky and prefill along with ring pipeline can determine alot, KV cache catching is something i haven't played around with yet)
33B class ~2.5β3.5x over Snapdragon 888
Small models see major NPU uplift
Memory bandwidth remains the limiter on large models
@Cellhasher has come along way driving inspo from
@exolabs over the last few months although things needed to change in order to be correctly managed for android really none of the EXO features are now used, we have our own modification to llama.cpp that overs this ring pipelined parallelism for running LARGE models across however many phones it takes.
What's hard is sometimes less phones doesn't always compute to high tokes as some would think less hops will do the trick, in some cases it does other cases like Spec drafting sometimes it doesn't. Regardless Automously running agents 24/7 if i Can run a 80b model at 1-5tok/s on 5 year old Android hardware that runs at 5-15w depending on the amount of phones used, ill take it.
If you make the upgrade to the latest generations of phones you get to experience amazing breakthrough of NPU sync and bandwidth optimization that can get you that amazing 20-70 tok/s feel depending per model for every day use.
And now im thinking about taking the plunge into 20 of these new phones 20k for 160cores, and 480gb RAM i could possible run the latest and greats at maybe speeds of 10 tok/s.... who knows let me know if you want me to try!
(All Phones benchmarked on Ethernet as it offers the best latency over Wifi) (Wifi Still works just slower)
π Introducing the Qwen 3.5 Small Model Series
Qwen3.5-0.8B Β· Qwen3.5-2B Β· Qwen3.5-4B Β· Qwen3.5-9B
β¨ More intelligence, less compute.
These small models are built on the same Qwen3.5 foundation β native multimodal, improved architecture, scaled RL:
β’ 0.8B / 2B β tiny, fast, great for edge device
β’ 4B β a surprisingly strong multimodal base for lightweight agents
β’ 9B β compact, but already closing the gap with much larger models
And yes β weβre also releasing the Base models as well.
We hope this better supports research, experimentation, and real-world industrial innovation.
Hugging Face:
huggingface.co/collections/Qβ¦
ModelScope:
modelscope.cn/collections/Qwβ¦