Joined June 2024
33 Photos and videos
Pinned Tweet
Apr 22
We're open-sourcing PulseBench-Tab, a frontier benchmark for table extraction. Table parsing remains one of the hardest and most poorly measured problems in document intelligence. TEDS operates on DOM trees and conflates HTML formatting conventions with structural errors. Needleman-Wunsch linearizes a two-dimensional structure into a one-dimensional sequence, so column transpositions can still score well because values align with nearby cells. GriTS uses greedy grid matching rather than optimal assignment and does not distinguish edge directions. The upshot: existing metrics cannot reliably separate content errors from structural errors, which makes provider comparisons noisy and downstream reliability unknowable. Alongside the dataset, our research team developed T-LAG. It parses each table into a cell-position grid, emits directed RIGHT and BELOW adjacency edges (suppressed within spanning cells, deduplicated by source, target, and direction), weights each candidate edge pair by the product of Levenshtein-derived similarities on source and target text, and uses the Hungarian algorithm for globally optimal one-to-one assignment. The F1 over matched edge weight is the T-LAG score. Structure and content are evaluated in one unified pass. HTML formatting choices do not affect the result. Rankings are invariant to the similarity exponent across k ∈ {7, 8, 9, 11}. The dataset contains 1,820 human-annotated tables across 9 languages and 4 scripts (Latin, CJK, Arabic, Cyrillic), drawn from 380 real-world financial filings, government reports, and regulatory disclosures. Tables range from 2 to 1,183 cells; 48.1% contain merged or spanning cells. Ground truth was produced through 8 annotation rounds with native speakers per language, independent cross-lingual review, and adversarial cell-by-cell audits against source images. We evaluated 9 commercial and open-source systems independently across the full dataset under exclude-missing scoring. Selected findings: @Pulse__AI Ultra 2 scores 0.9347 T-LAG; the next closest system scores 0.8155. Pulse Ultra 2 is the only provider with a median of 1.0, corresponding to perfect extraction on 57.9% of samples. Non-Latin scripts produce the widest cross-provider variance. On Arabic, the spread between top and bottom systems exceeds 75 percentage points. Structural hallucinations are pervasive. The second-ranked system achieves a perfect-extraction rate of 28.6%, meaning structural or content errors on 71.4% of tables (fabricated rows, invented content, incorrect span attributes, shifted data). Coverage failure is underreported. Multiple evaluated systems return no output on 19% to 21% of samples. Raw accuracy numbers without coverage disclosure favor selection bias. Thank you to Dushyanth Sekhar and Mohammed Hadi of S&P Global's Enterprise Data Organization for their academic contributions to the benchmark methodology. Dataset: huggingface.co/datasets/puls… Evaluation: github.com/Pulse-Software-Co… Blog: runpulse.com/blog/pulsebench… Research methodology: benchmark.runpulse.com/resea… Viewer: benchmark.runpulse.com
7
7
33
7,090
May 13
We co-authored a new post on the @awscloud ML Blog with Greg Fina, Jim Fratantoni, and ND Ngoka, walking through how @Pulse__AI extracts clean structured data from financial documents that break traditional OCR and turns it into a fine-tuning set for Amazon Nova Micro on Bedrock.
3
4
13
6,934
May 6
Every enterprise that processes documents eventually hits the same wall. Extracting data from PDFs is one thing, but putting data back into a PDF is where the workflow breaks down. Most enterprise PDFs don't have defined form fields at all. Government forms arrive as scanned images, insurance claims are flat PDFs, and healthcare intake forms have been printed and photocopied beyond recognition. Traditional tools have nothing to work with. @Pulse__AI Form Fill uses our vision-language models to analyze the full document layout, detect where fillable regions exist based on visual context, and place values in the correct locations. You provide structured data or natural language instructions, and Pulse returns a filled PDF. Healthcare teams are already using this to automate pre-auth forms, CMS-1500 claims, and provider enrollment packets through a single API call. Full blog post: runpulse.com/blog/introducin…
2
2
10
374
Apr 24
The @Pulse__AI research team just published a new blog: Why LLMs Cannot Own Enterprise Document Parsing. The core argument: reading a document and ingesting one are different problems. An LLM can read a PDF and give you something clean-looking. Ingestion means preserving exact text, numbers, table structure, and source evidence across millions of documents, deterministically and auditably. That's what enterprise workflows in finance, healthcare, legal, and insurance actually need. The dangerous failure mode of LLM-only parsing is silent: often markdown may look polished, tables look coherent, but underneath, rows are dropped and decimals are incorrectly parsed. One wrong field can corrupt a downstream model, claim, or contract review. The right architecture is hybrid: computer vision for layout, OCR for text, table-specific extraction for structure, and language models for schema mapping, with deterministic validation for business rules.
4
2
12
346
Apr 24
3
110
sid retweeted
Regardless, we ran our newest model, @Pulse__AI Ultra 2, on RD-TableBench. Pulse Ultra 2 is state-of-the-art on RD-TableBench across both mean and median similarity on the full 1,000-sample set. Pulse Ultra 2 (SOTA): 0.9172 mean / 0.9684 median Reducto (2024): 0.9024 mean / 0.9278 median Reducto Agentic (2026): 0.8842 mean / 0.9308 median Reducto (2026): 0.8665 mean / 0.8957 median Mean reflects aggregate performance across the dataset. Median reflects how the model handles a typical document, without being skewed by a handful of unusually easy or hard samples. Read our in-depth analysis here on other benchmarks in the space: runpulse.com/blog/table-extr…
3
15
1,346
sid retweeted
🧵Our team at @Pulse__AI has been analyzing the table extraction benchmarks the industry is converging on. Most don't reflect how these systems get used in enterprise workflows, and in some cases the ground truth itself has issues that change how you'd interpret the leaderboard. The full analysis with examples is in the comments. One specific benchmark worth diving into deeper: We audited all 1,000 RD-TableBench ground truth files against their source images. 43 (4.3%) contain verifiable errors across three categories: scrambled text or wrong structure (28), garbled OCR including watermark bleed and hallucinated characters (13), and buffer artifacts with random digit sequences attached to real numeric values (2). On top of that, 89 of the 1,000 ground truth files are byte-for-byte identical to a single provider output shipped in the same dataset. In 15 of the 43 error cases, the ground truth and this provider share the exact same specific error while independent providers do not. Column headers in the same wrong word order ('tropical forest Arid scrub' instead of 'Arid tropical scrub forest'). Watermark text like pulled into cell content in both, and identical garbled character sequences on CJK samples. Benchmarks are becoming the basis for how document intelligence is being measured, and ground truth quality is the foundation they rest on. We think it's worth the community being able to evaluate that alongside the numbers.
17
15
52
8,250
sid retweeted
Apr 23
Table extraction is one of the hardest unsolved problems in enterprise document intelligence, because tables encode relationships where position is meaning and a number landing in the wrong cell doesn't just look wrong, it breaks downstream analytics, compliance workflows, and financial models. As the space matures, benchmarks are how the industry measures progress, so we took a close look at several of the public table extraction benchmarks, including OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench. What we found is that every one of them has a structural issue that lets providers look better than they actually are. TEDS conflates formatting with structure so that HTML convention differences register as errors. Spatial tolerance parameters can hide missing headers while still reporting high content accuracy. Linearizing 2D tables into 1D sequences means column swaps can align well despite being structurally wrong. And when we audited the 1,000 ground truth files in RD-TableBench against their source images, we identified 43 with verifiable errors, with 89 of them byte-identical to one provider's output and 15 containing the same specific error as that provider while independent providers got it right. This is what motivated us to build PulseBench-Tab: multilingual by design with 2D-aware scoring, structure cleanly separated from formatting, and everything open from the dataset to the scoring code to the per-sample results. Our goal is to help move the industry forward, and we'd love the community to pitch in on how we can continue to improve. Full breakdown, with examples, scoring impact, and the corrected ground truth analysis, linked below. Read the full analysis: runpulse.com/blog/table-extr…
1
7
123
Apr 23
We took a close look at several public table extraction benchmarks, including OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench, and found that every one of them has a structural/methodology issue that doesn’t properly evaluate document intelligence at scale. 1/ OmniDocBench's TEDS metric conflates formatting with structure. A 3x3 table with identical content scores differently based on whether it uses <thead> wrappers or plain <tr>. 2/ SCORE-Bench's spatial tolerance parameter can hide serious failures. Drop a financial table's header row, shift data up by one, and the benchmark reports high content accuracy while the headers are gone. 3/ ParseBench relies on frontier VLMs for ground truth, introducing model bias into the benchmark itself. Its table metric treats records as unordered bags, so column transposition and row reordering go unpenalized. The table set is also 503 pages, 54.5% from a single SERFF source, and English-only. 4/ RD-TableBench linearizes 2D tables into 1D sequences, so column swaps can align well despite being structurally wrong. We also audited all 1,000 ground truth files and found 43 with verifiable errors and 89 byte-identical to one provider's output. That's why we built PulseBench-Tab, a benchmark that is multilingual by design, uses 2D-aware scoring that preserves horizontal and vertical adjacency, cleanly separates structure from formatting, and is fully open from the dataset to the scoring code. Full breakdown by the @Pulse__AI team in comments.
4
2
12
317
sid retweeted
congrats @Pulse__AI @sid_mnk @ritvikpandey21 on the open source release! people obsess over inference throughput but if your document parsing layer is hallucinating rows and spans, none of that matters. what goes into the model is just as important as how fast it comes out coverage failure at 19-21% with nobody reporting it is the stat that should make every production team uncomfortable!
Apr 22
We're open-sourcing PulseBench-Tab, a frontier benchmark for table extraction. Table parsing remains one of the hardest and most poorly measured problems in document intelligence. TEDS operates on DOM trees and conflates HTML formatting conventions with structural errors. Needleman-Wunsch linearizes a two-dimensional structure into a one-dimensional sequence, so column transpositions can still score well because values align with nearby cells. GriTS uses greedy grid matching rather than optimal assignment and does not distinguish edge directions. The upshot: existing metrics cannot reliably separate content errors from structural errors, which makes provider comparisons noisy and downstream reliability unknowable. Alongside the dataset, our research team developed T-LAG. It parses each table into a cell-position grid, emits directed RIGHT and BELOW adjacency edges (suppressed within spanning cells, deduplicated by source, target, and direction), weights each candidate edge pair by the product of Levenshtein-derived similarities on source and target text, and uses the Hungarian algorithm for globally optimal one-to-one assignment. The F1 over matched edge weight is the T-LAG score. Structure and content are evaluated in one unified pass. HTML formatting choices do not affect the result. Rankings are invariant to the similarity exponent across k ∈ {7, 8, 9, 11}. The dataset contains 1,820 human-annotated tables across 9 languages and 4 scripts (Latin, CJK, Arabic, Cyrillic), drawn from 380 real-world financial filings, government reports, and regulatory disclosures. Tables range from 2 to 1,183 cells; 48.1% contain merged or spanning cells. Ground truth was produced through 8 annotation rounds with native speakers per language, independent cross-lingual review, and adversarial cell-by-cell audits against source images. We evaluated 9 commercial and open-source systems independently across the full dataset under exclude-missing scoring. Selected findings: @Pulse__AI Ultra 2 scores 0.9347 T-LAG; the next closest system scores 0.8155. Pulse Ultra 2 is the only provider with a median of 1.0, corresponding to perfect extraction on 57.9% of samples. Non-Latin scripts produce the widest cross-provider variance. On Arabic, the spread between top and bottom systems exceeds 75 percentage points. Structural hallucinations are pervasive. The second-ranked system achieves a perfect-extraction rate of 28.6%, meaning structural or content errors on 71.4% of tables (fabricated rows, invented content, incorrect span attributes, shifted data). Coverage failure is underreported. Multiple evaluated systems return no output on 19% to 21% of samples. Raw accuracy numbers without coverage disclosure favor selection bias. Thank you to Dushyanth Sekhar and Mohammed Hadi of S&P Global's Enterprise Data Organization for their academic contributions to the benchmark methodology. Dataset: huggingface.co/datasets/puls… Evaluation: github.com/Pulse-Software-Co… Blog: runpulse.com/blog/pulsebench… Research methodology: benchmark.runpulse.com/resea… Viewer: benchmark.runpulse.com
1
1
4
299
sid retweeted
Good benchmarks (not saturated, non-trivial tasks, aligned with the distribution of real work) are still rare. Great work from the @pulse__ai team in partnership with @SPGlobal. If you work on or utilize document AI in your workflows, it's worth a read. More from the team soon! @sid_mnk @ritvikpandey21
Apr 22
We're open-sourcing PulseBench-Tab, a frontier benchmark for table extraction. Table parsing remains one of the hardest and most poorly measured problems in document intelligence. TEDS operates on DOM trees and conflates HTML formatting conventions with structural errors. Needleman-Wunsch linearizes a two-dimensional structure into a one-dimensional sequence, so column transpositions can still score well because values align with nearby cells. GriTS uses greedy grid matching rather than optimal assignment and does not distinguish edge directions. The upshot: existing metrics cannot reliably separate content errors from structural errors, which makes provider comparisons noisy and downstream reliability unknowable. Alongside the dataset, our research team developed T-LAG. It parses each table into a cell-position grid, emits directed RIGHT and BELOW adjacency edges (suppressed within spanning cells, deduplicated by source, target, and direction), weights each candidate edge pair by the product of Levenshtein-derived similarities on source and target text, and uses the Hungarian algorithm for globally optimal one-to-one assignment. The F1 over matched edge weight is the T-LAG score. Structure and content are evaluated in one unified pass. HTML formatting choices do not affect the result. Rankings are invariant to the similarity exponent across k ∈ {7, 8, 9, 11}. The dataset contains 1,820 human-annotated tables across 9 languages and 4 scripts (Latin, CJK, Arabic, Cyrillic), drawn from 380 real-world financial filings, government reports, and regulatory disclosures. Tables range from 2 to 1,183 cells; 48.1% contain merged or spanning cells. Ground truth was produced through 8 annotation rounds with native speakers per language, independent cross-lingual review, and adversarial cell-by-cell audits against source images. We evaluated 9 commercial and open-source systems independently across the full dataset under exclude-missing scoring. Selected findings: @Pulse__AI Ultra 2 scores 0.9347 T-LAG; the next closest system scores 0.8155. Pulse Ultra 2 is the only provider with a median of 1.0, corresponding to perfect extraction on 57.9% of samples. Non-Latin scripts produce the widest cross-provider variance. On Arabic, the spread between top and bottom systems exceeds 75 percentage points. Structural hallucinations are pervasive. The second-ranked system achieves a perfect-extraction rate of 28.6%, meaning structural or content errors on 71.4% of tables (fabricated rows, invented content, incorrect span attributes, shifted data). Coverage failure is underreported. Multiple evaluated systems return no output on 19% to 21% of samples. Raw accuracy numbers without coverage disclosure favor selection bias. Thank you to Dushyanth Sekhar and Mohammed Hadi of S&P Global's Enterprise Data Organization for their academic contributions to the benchmark methodology. Dataset: huggingface.co/datasets/puls… Evaluation: github.com/Pulse-Software-Co… Blog: runpulse.com/blog/pulsebench… Research methodology: benchmark.runpulse.com/resea… Viewer: benchmark.runpulse.com
4
3
18
4,731
sid retweeted
Pulse just open-sourced PulseBench-Tab, a frontier benchmark for table extraction across 9 languages, alongside T-LAG, a new scoring metric that evaluates structure and content in one unified score. After processing billions of pages for Fortune 50 enterprises, the largest investment firms, and the leading AI startups, @Pulse__AI is opening up the evaluation methodology behind how they train and measure their models. Most existing benchmarks miss the structural relationships that determine whether a table was actually understood, and T-LAG fixes that. Pulse now sets the standard for what production-grade extraction actually requires. Congrats on the launch, @sid_mnk and @ritvikpandey21! runpulse.com/blog/pulsebench…
5
7
45
8,041
Apr 21
Ultra 2 from @Pulse__AI, our highest-accuracy document extraction model and the one now running in production at some of the largest banks and Fortune 100 enterprises. A year of work went into a new training corpus, a new annotation pipeline, and an entirely new architecture built from the ground up. The previous generation routed every document through a chain of independent models: OCR, layout detection, table detection, cell extraction, structure reconstruction. Each hop was a new failure point, and upstream mistakes cascaded downstream. Pulse Ultra 2 replaces that entire pipeline with a unified end-to-end architecture. One model, one forward pass, jointly solving layout understanding, text recognition, cell detection, and structure prediction. The accuracy gains show up most on the hardest document layouts: merged cells, borderless tables, multilingual documents across 100 languages, and complex multi-table pages. Given the high computational demand, we're rolling out with specific rate limits for non-enterprise accounts while we scale capacity. Reach out to our team with the link below for higher rate limits or dedicated capacity.
5
3
10
269
Apr 21
Read more below: runpulse.com/blog/introducin… Documentation: docs.runpulse.com/api-refere… Reach out for higher limits: runpulse.com/pulse-ultra-2-a…
2
50
Apr 14
👀
Replying to @kepano
I just tried it this morning on the 245-page Mythos pdf and it failed badly and the outputs were all mangled. Converting pdfs is really hard, I think it has to probably be a Skill not a program, for a SOTA LLM for it to work properly.
2
106
sid retweeted
Replying to @karpathy @kepano
this is what we do at @pulse__ai day to day. ran a cool page from the mythos system card on pulse, the important bit is properly parsing chart info into structured tabular formats so agents can work with it. i use this a good bit in my personal projects with obsidian!
1
5
533
sid retweeted
I am super excited to share that @Starcloud_ has raised a $170M Series A at a $1.1bn valuation to fuel our development of data centers in space 🚀 The round comes after the successful deployment of our first satellite, Starcould-1, a few months ago, which had the first @NVIDIA H100 on board and was the first to train an LLM in space. The funds will be used to develop our third satellite, which aims to be cost-competitive with Earth-based data centers in terms of AI inference cost. The round was led by @Benchmark and @EQT Ventures, and we are excited to welcome Benchmark GP, @Chetanp Puttagunta, to our board. We are also excited to welcome other new investors, including the world's largest infrastructure fund, @Macquarie Capital, @SevenSevenSix 7️⃣7️⃣6️⃣, Manhattan West, Adjacent, Carya, GSBackers, and Harpoon. We are very grateful for the continued support of existing investors, including @NFX@NebularVC@YCombinator@FUSE_VC@Soma_Capital, 3Capital Partners, Wyld VC, Tiny VC, and Taurus Ventures. Onwards!
222
164
1,773
472,717
sid retweeted
Exclusive: Andromeda started out as a project by AI-focused VC firm NFDG. Now it's a standalone startup with a run rate of $100M and a $1.5B valuation after raising $60M from Paradigm. I spoke to CEO @WMoushey about his plans to build a better market around global AI compute.
7
17
110
27,039