Aasim Mahmood | ₿

Aasim Mahmood | ₿

Users
Tweets

Aasim Mahmood | ₿

@K9Aasim

May 4

Replying to @XFreeze

Absolutely, this is exactly the kind of shift we need to spotlight. Here's a draft for an engaging quote tweet with the relevant data and hashtags, ready to cut through the noise: 🧠⚖️ Not just hype — numbers. Grok 4.3 taking the #1 spot on Vals AI’s notoriously tough CaseLaw (79.31%) and CorpFin (68.53%) benchmarks is a massive signal. Beating GPT-5.1 in deep legal reasoning & dense financial contracts proves xAI's "always-on" resident reasoning architecture is delivering real utility where it matters most. This legal/finance dominance, paired with crazy cost efficiency at just $1.25/M input tokens & $2.50/M output (a 40% drop from Grok 4.2) and backed by a fresh $20B Series E war chest, makes this more than a model update — it's a direct shot at the enterprise market. Practical precision meets aggressive pricing. The AI race just shifted from general chat to specialized, high-stakes work. #Grok4 #xAI #GeniusAct #ValsAI #LegalAI #FinTech #LLMBenchmarks Let me know if you'd like to tweak the tone or emphasize a different angle—happy to refine it further.

101

Jehangeer H

Jehangeer H

@jehangeer_hasan

Mar 31

Replying to @XFreeze

97% on 𝜏²-Bench is impressive, especially for agentic tool use. The industry has been sleeping on how crucial latency is for agentic workflows if Grok is truly the fastest t/s and top-tier on accuracy, that’s the sweet spot for production agents. Speed means nothing without precision, but this looks like the full package. #AgenticAI #ToolCalling #Grok #AIEngineering #LLMBenchmarks #LatencyMatters #Tech

ByteBrief Tech Insights

ByteBrief Tech Insights

@ByteBriefTech

29 Dec 2025

🚨🚨Scored 12 LLM frontier models across 3 axes🚨🚨 🧠 Intelligence | 💻 Coding | 🤖 Agentic behavior Then stacked pricing (📥 input, 📤 output) underneath for context. The fun part is the shape of the tradeoffs: some models are “peak scores,” others are “best $ per capability.” Charts attached 👇 artificialanalysis.ai @GoogleAI @AnthropicAI @xai @OpenAI #AI #LLM #aiagents #benchmarks #aibenchmarks #llmbenchmarks

1,042

BentoML

BentoML

@bentomlai

20 Aug 2025

LLM leaderboards rank the top LLMs, but a high score doesn’t guarantee the best model for your use case. To find the optimal setup for inference, you need custom benchmarks tailored to your hardware, framework, and workload. This often means balancing trade-offs in throughput, latency, cost, and more. Our LLM Inference Handbook breaks down: ✅ When benchmarks help (and when they don’t) ✅ Tools for LLM performance benchmarking ✅ Key metrics every team should measure ✅ A practical template for reporting benchmark results 🔗 Learn more: bentoml.com/llm/inference-op… #LLMs #LLMBenchmarks #LLMInferenceHandbook

231

Soumendra Kumar Sahoo

Soumendra Kumar Sahoo @soumendrak_

11 May 2025

Your LLM evals might be burning cash for no reason. More evaluations ≠ better results. Generic metrics, excessive scope, and inadequate sampling are undermining your ROI. Smart judges need context, justification, and human validation. #AI #LLMBenchmarks #AIObservability

262

Chidanand Tripathi

Chidanand Tripathi

@thetripathi58

29 Apr 2025

4/ Create a presentation on LLMBenchmarks using Deep Agent

0:26

813

Jacobarrio

Jacobarrio @jlee8648

11 Apr 2025

CURIE introduced custom evals like LLMSim and LMScore to grade nuanced outputs (like equations, summaries, YAML, code). Even the best models (Claude 3, Gemini, GPT-4) scored just ~32%. Proteins? Total fail. LLMs can read papers — solving them is another matter. #LLMbenchmarks #ArtificialInteligence #Google

WinBuzzer

WinBuzzer @WBuzzer

23 Mar 2025

Tencent Releases its Hunyuan T1 AI Reasoning Model, Beating DeepSeek R1, GPT-4.5, o1 Across Multiple Benchmarks #AI #GenAI #TencentAI #HunyuanT1 #AIReasoning #EnterpriseAI #LLMbenchmarks #ChinaAI #MMLU #MathAI #AIModels #AIInference winbuzzer.com/2025/03/23/ten…

Tencent Releases its Hunyuan T1 AI Reasoning Model, Beating DeepSeek R1, GPT-4.5, o1 Across...

Tencent has positioned Hunyuan T1 as a reasoning-optimized model, with benchmark results confirming its strengths in structured logic and math accuracy.

winbuzzer.com

219

prasad kompalli

prasad kompalli @pkompalli

2 Mar 2025

A small observation : more than solving HL math/physics/coding problems, I find getting LLMs to 'formulate' good set of solvable problems in a given topic ( algebra, geometry ... ) is a challenge. LLMs should be benchmarked in this. #GenAI #LLMbenchmarks

159

Data Science Dojo

Data Science Dojo

@DataScienceDojo

7 Nov 2024

Evaluating Your LLM? Here’s the Secret Sauce to Get it Right! 📊 Dive into the key metrics and methods that can help you assess and fine-tune your large language model, so it’s ready for the real world. hubs.la/Q02XlW920 #LLMs #LLMEvaluation #LLMBenchmarks

Master LLM Evaluation: The Ultimate Guide to Better Insights

This comprehensive LLM evaluation guide explains the importance of benchmarks, metrics, and leaderboards to measure LLM capabilities in real world applications.

datasciencedojo.com

1,506

Simon P

Simon P

@simonkp

24 Feb 2024

Replying to @emollick

While the giant context window and video capabilities grab headlines, Gemini Pro 1.5's core model performance shouldn't be overlooked. Surpassing Ultra 1.0 and nearing GPT-4 is impressive. Eager to see how this translates to real-world applications! #LLMBenchmarks #AIInnovation

418