weekend
@evo__hq activities: given the recent sovereign-AI narrative (amidst anthropic's fable pullback), i wanted to see if evo could help make some of the indian models we have better in whatever way.
so i kicked off an autoresearch run on evo to see if
@SarvamAI's 30B decode throughput could be improved, at bf16, on a single H100.
currently 10 hours in. so far, evo seems to have found ~3% improvement.
the metric is geometric mean tok/s across batch sizes 64 / 128 / 256, measuring steady-state decode only. prefill is timed out, so this is purely per-token decode rate on a fixed workload.
evo also ensures that anything that got faster by changing outputs, lowering precision, or messing with MoE routing was rejected by the accuracy gate.
the gate compares each candidate against a frozen baseline on both next-token distributions and actual decoded tokens. if argmax agreement or logprob drift moves meaningfully, the change is rejected, even if it is faster.
very imp caveat: these are experiment-harness numbers, not production serving numbers.
the gain still needs to be validated in a real serving setup before anyone treats it as real capacity.
i also havent done an external audit for any benchmark hack the agent may have done.
a potential ~3% bump is a potentially a pretty significant improvement for someone like sarvam at that scale. decode is a major part of inference cost. a ~3% decode-side throughput gain at identical accuracy means more capacity, or fewer GPUs, without changing the model.
i also want to s/o to
@vishnuvig of
jarvislabs.ai /
@e2enetworks for compute support. i am trying to use more and more compute from indian providers as much as i can and give feedback to improve the experience as well