stevibe

stevibe

Users
Tweets

stevibe

@stevibe

13h

Tested DiffusionGemma 26B A4B vs Gemma4 26B A4B (BenchLocal, 4 bench packs): Diffusion | Gemma4 > ToolCall-15: 83 | 97 > BugFind-15: 92 | 82 > DataExtract-15: 76 | 83 > ReasonMath-15: 50 | 89 Gemma4 leads overall, the ReasonMath gap is steep. But DiffusionGemma edged it out on BugFind-15, which surprised me. Diffusion text quality looks rougher right now, but it's still an experiment. Curious to see where it lands long-term.

3,801

stevibe

stevibe

@stevibe

May 29

Step 3.7 Flash — 4 BenchLocal Results > ToolCall-15: 83 > BugFind-15: 95 > HermesAgent-20: 78 > DataExtract-15: 83

14,725

Diego Carlino

Diego Carlino

@_carlid_dev

May 19

BenchLocal results for DeepSeek v4 flash q2-imatrix served by ds4 DGX spark specific CUDA kernel, 140k ctx size BugFind-15: 86 CLI-40: 53 DataExtract-15: 88 HermesAgent-20: 84 InstructFollow-15: 93 ReasonMath-15: 63 StructOutput-15: 80 ToolCall-15: 93

antirez @antirez

May 19

Interesting DS4 2bit testing going on.

265

Anant 🔱

Anant 🔱@Oh_anant

Apr 19

x.com/i/article/204579277403…

stevibe

stevibe

@stevibe

Apr 17

Introducing HermesAgent-20, a new Bench Pack for BenchLocal. 20 scenarios extracted straight from the Hermes Agent source code, run against a REAL Hermes instance. The actual workload you'd put your model through. Why I built BenchLocal in the first place: most benchmarks are too abstract. We use local LLMs for practical work, and finding the right model for YOUR task efficiently is the single most important thing, especially when you're constrained to what fits on your machine. BenchLocal is a framework: providers, models, side-by-side comparison, all in one UI. Bench Packs are the unit of testing: ToolCall-15 and BugFind-15 shipped first, and when I launched the BenchLocal 0.1.0, added StructOutput, ReasonMath, InstructFollow, DataExtract. Now, HermesAgent-20 is the newest. Bench Packs install like VS Code extensions. The SDK is open, write your own, share it, grow the ecosystem. Here's the goal: a community-built, practical evaluation layer for the local LLM space. Early numbers on HermesAgent-20: > GLM 5.1 — 85 > Gemma4 31B — 83 > Qwen3.5 27B — 79 > MiniMax M2.7 — 76 Upgrade to the latest BenchLocal to install HermesAgent-20 (SDK update required).

0:20

315

38,631

stevibe

stevibe

@stevibe

Apr 16

Qwen3.6 35B-A3B: smarter, but forgot how to use tools? Running 6 Bench Packs on BenchLocal across 3 open-source Qwen models. ✅ ReasonMath: 92 vs 85 vs 86 — 3.6 wins ✅ InstructFollow: 97 / 97 / 97 — tied ❌ ToolCall: 83 vs 97 vs 100 — 3.6 tanks Qwen3.5 27B still the tool-calling champ. 3.6 clearly leveled up reasoning, but tool use took a hit. DataExtract live now. BugFind StructOutput next.

390

36,739

Berryxia.AI

Berryxia.AI

@berryxia

Apr 14

本地LLM选型终于有硬核实用基准了！@stevibe 开源 macOS App BenchLocal，一站式测试平台，直接起飞！再也不用靠抽象leaderboard猜模型—— ✅ 6大真实Bench Pack（ToolCall-15工具调用、BugFind-15调试、DataExtract-15结构化提取、InstructFollow-15等） ✅ 每个Pack 15个固定场景，结果完全确定性、可验证 ✅ 支持Ollama、llama.cpp、OpenRouter及所有OpenAI兼容接口 ✅ SDK开放，社区可像VS Code插件一样贡献自定义测试包本地AI & Agent开发者选模型必备神器！MIT开源，macOS v0.1已上线，Win/Linux即将到来 BenchLocal GitHub 下载👇

0:56

stevibe

@stevibe

Apr 13

I built a macOS app for benchmarking local LLMs. 6 test suites. Multiple providers. One workspace. Open source. There are hundreds of local models now. New ones every week. How do you actually pick one? Leaderboards test for general ability. But if you're building an agent that chains tool calls, or a pipeline that extracts structured data, or a code assistant that needs to debug Rust, you need to know if the model handles that specific thing. Not in theory. On your hardware. With your prompts. The benchmarks that exist are either locked behind papers, too abstract to map to real failures, or impossible to extend. You can't add your own test cases. You can't test what matters to your use case. That's what BenchLocal is for. It's a benchmark platform where every test is practical, deterministic, and built around real-world tasks. And you can build your own tests. It ships with 6 Bench Packs TODAY: → ToolCall-15 — tool-use accuracy → BugFind-15 — debugging capabilities → DataExtract-15 — structured data extraction → InstructFollow-15 — constraint-heavy instruction following → ReasonMath-15 — practical reasoning and math → StructOutput-15 — validator-backed structured output Every pack has 15 fixed scenarios. Every score is deterministic and verifiable. Some of you saw ToolCall-15 and BugFind-15 — the individual test packs I open-sourced over the past few weeks. People ran them, filed issues, sent PRs. But managing separate repos, separate scripts, separate results doesn't scale. BenchLocal puts everything in one place. What the app does: > Workspace with tabs — run BugFind-15 in one tab, ToolCall-15 in another. > Any provider — Ollama, llama.cpp, OpenRouter, any OpenAI-compatible endpoint. Local and cloud, same interface. > Run modes — serial, batch per model, batch per test case, or fully parallel. > Test histories — every run saved. Compare any previous session. But the part I'm most excited about isn't the app. It's the ecosystem. BenchLocal is a platform. Each Bench Pack is a plugin. I'm shipping an SDK so anyone can build their own — test what matters to you, package it, share it. Install and uninstall packs right inside the app, same way you'd manage extensions in VS Code. The registry is GitHub-based, fully public. I built 6 packs. I want the community to build the next 60. Theme system built in too — because if I'm staring at benchmark results for hours, it should at least look good. v0.1.0 is macOS only. Windows and Linux are coming. MIT licensed. Everything — the app, the bench packs, the SDK — is open. PRs welcome. Bench Packs even more welcome.

0:56

2,541

stevibe

stevibe

@stevibe

Apr 13

0:56

302

50,538

Gujarat Samachar

Gujarat Samachar

@gujratsamachar

26 Jun 2025

Ahmedabad plane crash: Data downloaded from black box of AI171 flight, memory module accessed #ahmedabad #ahmedabadnews #blackbox #update #dataextract #ministryofaviation #memorymodule english.gujaratsamachar.com/…

Ahmedabad plane crash: Data downloaded from black box of AI171 flight, memory module accessed

The Ministry of Civil Aviation shared an update on the data extraction process on Thursday.

english.gujaratsamachar.com

590

AlgoDocs - Real Time Document Data Extraction

AlgoDocs - Real Time Document Data Extraction @AlgoDocs

16 Dec 2024

🗒✍🔀 Capture and extract data from handwritten PDF files with 100% accuracy and 10X speed using AlgoDocs. Easily save data in CSV, Excel, JSON, or XML formats.⤵️ youtube.com/watch?v=LQ2MhJKZ… #algodocs #dataextract #Aitools #AIdataextraction #OCR #ocrapi #IDP #Documentprocessing

Extract Handwritten Text Using AlgoDocs

Effortlessly Extract Handwritten Data from PDFs with AlgoDocsTire...

youtube.com

BedRock Data Solutions

BedRock Data Solutions @BedrockData

2 Sep 2024

Let's Go to the Airport... #CleaningToilets #SeeDataClearly #airports #travel #architecture #data #dataarchitecture #metadata #datasecurity #datagovernance #GDPR #dataextract #datatransportlayers #datastorage #dataAPIs #masterdatamangement ow.ly/1acw50Tc54t

Shishir Sutradhar

Shishir Sutradhar @Shishiranik4

11 Nov 2023

Check out my Data Specialist Services fiverr.com/s/76L611 #Dataentry #Webscrapping #Datascrapping #Dataextract

docWays GmbH - wir bringen Dokumente auf den Weg!

Eugene Hurynovich

Eugene Hurynovich @ehurynovich

8 Aug 2019

We have the best technology for extracting data from detailed pages websites based on machine learning #Parsers #webscraper #dataextract #DataMining #data #webdata #importio

Parsers VC @Parsers_vc

8 Aug 2019

We tested where it is faster to configure data extraction from a site with @importio or @Parsers_me Tested on the site that is used in the importio video. 7 times faster! What's better? youtube.com/watch?v=JgaTcjFh… #Webscraper #bestscraper #webdata #dataextraction #ML #data #scraper

GeoscienceAus Data

GeoscienceAus Data @GeoAusData

9 Jan 2017

New OZTemp Well Temperature Data Extract web service now available: bit.ly/2itxd4T #Temperature #TemperatureData #DataExtract

Shailesh

Shailesh @sjamlokiAtIBM

6 Dec 2016

2.#Slingshot connects to the src #database 2 start the #DataExtract, #transport, #TargetLoad #process.(3/5) #BluemixLift @IBMIIG @IBMBluemix

Shailesh

Shailesh @sjamlokiAtIBM

1 Dec 2016

#IBM #BluemixLift #service automatically #recovers from problems encountered during #dataextract, #transport and load.#cloudcomputing #data