Filter
Exclude
Time range
-
Near
Replying to @XFreeze
Absolutely, this is exactly the kind of shift we need to spotlight. Here's a draft for an engaging quote tweet with the relevant data and hashtags, ready to cut through the noise: 🧠⚖️ Not just hype — numbers. Grok 4.3 taking the #1 spot on Vals AI’s notoriously tough CaseLaw (79.31%) and CorpFin (68.53%) benchmarks is a massive signal. Beating GPT-5.1 in deep legal reasoning & dense financial contracts proves xAI's "always-on" resident reasoning architecture is delivering real utility where it matters most. This legal/finance dominance, paired with crazy cost efficiency at just $1.25/M input tokens & $2.50/M output (a 40% drop from Grok 4.2) and backed by a fresh $20B Series E war chest, makes this more than a model update — it's a direct shot at the enterprise market. Practical precision meets aggressive pricing. The AI race just shifted from general chat to specialized, high-stakes work. #Grok4 #xAI #GeniusAct #ValsAI #LegalAI #FinTech #LLMBenchmarks Let me know if you'd like to tweak the tone or emphasize a different angle—happy to refine it further.
1
1
101
Replying to @XFreeze
97% on 𝜏²-Bench is impressive, especially for agentic tool use. The industry has been sleeping on how crucial latency is for agentic workflows if Grok is truly the fastest t/s and top-tier on accuracy, that’s the sweet spot for production agents. Speed means nothing without precision, but this looks like the full package. #AgenticAI #ToolCalling #Grok #AIEngineering #LLMBenchmarks #LatencyMatters #Tech
1
1
4
36
🚨🚨Scored 12 LLM frontier models across 3 axes🚨🚨 🧠 Intelligence | 💻 Coding | 🤖 Agentic behavior Then stacked pricing (📥 input, 📤 output) underneath for context. The fun part is the shape of the tradeoffs: some models are “peak scores,” others are “best $ per capability.” Charts attached 👇 artificialanalysis.ai @GoogleAI @AnthropicAI @xai @OpenAI #AI #LLM #aiagents #benchmarks #aibenchmarks #llmbenchmarks
2
4
1,042
20 Aug 2025
LLM leaderboards rank the top LLMs, but a high score doesn’t guarantee the best model for your use case. To find the optimal setup for inference, you need custom benchmarks tailored to your hardware, framework, and workload. This often means balancing trade-offs in throughput, latency, cost, and more. Our LLM Inference Handbook breaks down: ✅ When benchmarks help (and when they don’t) ✅ Tools for LLM performance benchmarking ✅ Key metrics every team should measure ✅ A practical template for reporting benchmark results 🔗 Learn more: bentoml.com/llm/inference-op… #LLMs #LLMBenchmarks #LLMInferenceHandbook
4
231
Your LLM evals might be burning cash for no reason. More evaluations ≠ better results. Generic metrics, excessive scope, and inadequate sampling are undermining your ROI. Smart judges need context, justification, and human validation. #AI #LLMBenchmarks #AIObservability
1
1
2
262
4/ Create a presentation on LLMBenchmarks using Deep Agent
1
3
11
813
11 Apr 2025
CURIE introduced custom evals like LLMSim and LMScore to grade nuanced outputs (like equations, summaries, YAML, code). Even the best models (Claude 3, Gemini, GPT-4) scored just ~32%. Proteins? Total fail. LLMs can read papers — solving them is another matter. #LLMbenchmarks #ArtificialInteligence #Google
2
2
30
A small observation : more than solving HL math/physics/coding problems, I find getting LLMs to 'formulate' good set of solvable problems in a given topic ( algebra, geometry ... ) is a challenge. LLMs should be benchmarked in this. #GenAI #LLMbenchmarks
2
3
159
Evaluating Your LLM? Here’s the Secret Sauce to Get it Right! 📊 Dive into the key metrics and methods that can help you assess and fine-tune your large language model, so it’s ready for the real world. hubs.la/Q02XlW920 #LLMs #LLMEvaluation #LLMBenchmarks
1
5
1,506
24 Feb 2024
Replying to @emollick
While the giant context window and video capabilities grab headlines, Gemini Pro 1.5's core model performance shouldn't be overlooked. Surpassing Ultra 1.0 and nearing GPT-4 is impressive. Eager to see how this translates to real-world applications! #LLMBenchmarks #AIInnovation
1
418