Production-grade unstructured document extraction

Joined June 2024
4 Photos and videos
Pulse retweeted
May 13
We co-authored a new post on the @awscloud ML Blog with Greg Fina, Jim Fratantoni, and ND Ngoka, walking through how @Pulse__AI extracts clean structured data from financial documents that break traditional OCR and turns it into a fine-tuning set for Amazon Nova Micro on Bedrock.
3
4
13
6,934
Pulse retweeted
May 6
Every enterprise that processes documents eventually hits the same wall. Extracting data from PDFs is one thing, but putting data back into a PDF is where the workflow breaks down. Most enterprise PDFs don't have defined form fields at all. Government forms arrive as scanned images, insurance claims are flat PDFs, and healthcare intake forms have been printed and photocopied beyond recognition. Traditional tools have nothing to work with. @Pulse__AI Form Fill uses our vision-language models to analyze the full document layout, detect where fillable regions exist based on visual context, and place values in the correct locations. You provide structured data or natural language instructions, and Pulse returns a filled PDF. Healthcare teams are already using this to automate pre-auth forms, CMS-1500 claims, and provider enrollment packets through a single API call. Full blog post: runpulse.com/blog/introducin…
2
2
10
374
Pulse retweeted
Apr 24
The @Pulse__AI research team just published a new blog: Why LLMs Cannot Own Enterprise Document Parsing. The core argument: reading a document and ingesting one are different problems. An LLM can read a PDF and give you something clean-looking. Ingestion means preserving exact text, numbers, table structure, and source evidence across millions of documents, deterministically and auditably. That's what enterprise workflows in finance, healthcare, legal, and insurance actually need. The dangerous failure mode of LLM-only parsing is silent: often markdown may look polished, tables look coherent, but underneath, rows are dropped and decimals are incorrectly parsed. One wrong field can corrupt a downstream model, claim, or contract review. The right architecture is hybrid: computer vision for layout, OCR for text, table-specific extraction for structure, and language models for schema mapping, with deterministic validation for business rules.
4
2
12
346
Pulse retweeted
Regardless, we ran our newest model, @Pulse__AI Ultra 2, on RD-TableBench. Pulse Ultra 2 is state-of-the-art on RD-TableBench across both mean and median similarity on the full 1,000-sample set. Pulse Ultra 2 (SOTA): 0.9172 mean / 0.9684 median Reducto (2024): 0.9024 mean / 0.9278 median Reducto Agentic (2026): 0.8842 mean / 0.9308 median Reducto (2026): 0.8665 mean / 0.8957 median Mean reflects aggregate performance across the dataset. Median reflects how the model handles a typical document, without being skewed by a handful of unusually easy or hard samples. Read our in-depth analysis here on other benchmarks in the space: runpulse.com/blog/table-extr…
3
15
1,346
Pulse retweeted

Regardless, we ran our newest model, @Pulse__AI Ultra 2, on RD-TableBench. Pulse Ultra 2 is state-of-the-art on RD-TableBench across both mean and median similarity on the full 1,000-sample set. Pulse Ultra 2 (SOTA): 0.9172 mean / 0.9684 median Reducto (2024): 0.9024 mean / 0.9278 median Reducto Agentic (2026): 0.8842 mean / 0.9308 median Reducto (2026): 0.8665 mean / 0.8957 median Mean reflects aggregate performance across the dataset. Median reflects how the model handles a typical document, without being skewed by a handful of unusually easy or hard samples. Read our in-depth analysis here on other benchmarks in the space: runpulse.com/blog/table-extr…
2
5
867
Pulse retweeted
🧵Our team at @Pulse__AI has been analyzing the table extraction benchmarks the industry is converging on. Most don't reflect how these systems get used in enterprise workflows, and in some cases the ground truth itself has issues that change how you'd interpret the leaderboard. The full analysis with examples is in the comments. One specific benchmark worth diving into deeper: We audited all 1,000 RD-TableBench ground truth files against their source images. 43 (4.3%) contain verifiable errors across three categories: scrambled text or wrong structure (28), garbled OCR including watermark bleed and hallucinated characters (13), and buffer artifacts with random digit sequences attached to real numeric values (2). On top of that, 89 of the 1,000 ground truth files are byte-for-byte identical to a single provider output shipped in the same dataset. In 15 of the 43 error cases, the ground truth and this provider share the exact same specific error while independent providers do not. Column headers in the same wrong word order ('tropical forest Arid scrub' instead of 'Arid tropical scrub forest'). Watermark text like pulled into cell content in both, and identical garbled character sequences on CJK samples. Benchmarks are becoming the basis for how document intelligence is being measured, and ground truth quality is the foundation they rest on. We think it's worth the community being able to evaluate that alongside the numbers.
17
15
52
8,250
Table extraction is one of the hardest unsolved problems in enterprise document intelligence, because tables encode relationships where position is meaning and a number landing in the wrong cell doesn't just look wrong, it breaks downstream analytics, compliance workflows, and financial models. As the space matures, benchmarks are how the industry measures progress, so we took a close look at several of the public table extraction benchmarks, including OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench. What we found is that every one of them has a structural issue that lets providers look better than they actually are. TEDS conflates formatting with structure so that HTML convention differences register as errors. Spatial tolerance parameters can hide missing headers while still reporting high content accuracy. Linearizing 2D tables into 1D sequences means column swaps can align well despite being structurally wrong. And when we audited the 1,000 ground truth files in RD-TableBench against their source images, we identified 43 with verifiable errors, with 89 of them byte-identical to one provider's output and 15 containing the same specific error as that provider while independent providers got it right. This is what motivated us to build PulseBench-Tab: multilingual by design with 2D-aware scoring, structure cleanly separated from formatting, and everything open from the dataset to the scoring code to the per-sample results. Our goal is to help move the industry forward, and we'd love the community to pitch in on how we can continue to improve. Full breakdown, with examples, scoring impact, and the corrected ground truth analysis, linked below. Read the full analysis: runpulse.com/blog/table-extr…
1
7
123
Pulse retweeted
Apr 23
We took a close look at several public table extraction benchmarks, including OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench, and found that every one of them has a structural/methodology issue that doesn’t properly evaluate document intelligence at scale. 1/ OmniDocBench's TEDS metric conflates formatting with structure. A 3x3 table with identical content scores differently based on whether it uses <thead> wrappers or plain <tr>. 2/ SCORE-Bench's spatial tolerance parameter can hide serious failures. Drop a financial table's header row, shift data up by one, and the benchmark reports high content accuracy while the headers are gone. 3/ ParseBench relies on frontier VLMs for ground truth, introducing model bias into the benchmark itself. Its table metric treats records as unordered bags, so column transposition and row reordering go unpenalized. The table set is also 503 pages, 54.5% from a single SERFF source, and English-only. 4/ RD-TableBench linearizes 2D tables into 1D sequences, so column swaps can align well despite being structurally wrong. We also audited all 1,000 ground truth files and found 43 with verifiable errors and 89 byte-identical to one provider's output. That's why we built PulseBench-Tab, a benchmark that is multilingual by design, uses 2D-aware scoring that preserves horizontal and vertical adjacency, cleanly separates structure from formatting, and is fully open from the dataset to the scoring code. Full breakdown by the @Pulse__AI team in comments.
4
2
12
317
Pulse retweeted
Great work by @sid_mnk and team!
Apr 22
We're open-sourcing PulseBench-Tab, a frontier benchmark for table extraction. Table parsing remains one of the hardest and most poorly measured problems in document intelligence. TEDS operates on DOM trees and conflates HTML formatting conventions with structural errors. Needleman-Wunsch linearizes a two-dimensional structure into a one-dimensional sequence, so column transpositions can still score well because values align with nearby cells. GriTS uses greedy grid matching rather than optimal assignment and does not distinguish edge directions. The upshot: existing metrics cannot reliably separate content errors from structural errors, which makes provider comparisons noisy and downstream reliability unknowable. Alongside the dataset, our research team developed T-LAG. It parses each table into a cell-position grid, emits directed RIGHT and BELOW adjacency edges (suppressed within spanning cells, deduplicated by source, target, and direction), weights each candidate edge pair by the product of Levenshtein-derived similarities on source and target text, and uses the Hungarian algorithm for globally optimal one-to-one assignment. The F1 over matched edge weight is the T-LAG score. Structure and content are evaluated in one unified pass. HTML formatting choices do not affect the result. Rankings are invariant to the similarity exponent across k ∈ {7, 8, 9, 11}. The dataset contains 1,820 human-annotated tables across 9 languages and 4 scripts (Latin, CJK, Arabic, Cyrillic), drawn from 380 real-world financial filings, government reports, and regulatory disclosures. Tables range from 2 to 1,183 cells; 48.1% contain merged or spanning cells. Ground truth was produced through 8 annotation rounds with native speakers per language, independent cross-lingual review, and adversarial cell-by-cell audits against source images. We evaluated 9 commercial and open-source systems independently across the full dataset under exclude-missing scoring. Selected findings: @Pulse__AI Ultra 2 scores 0.9347 T-LAG; the next closest system scores 0.8155. Pulse Ultra 2 is the only provider with a median of 1.0, corresponding to perfect extraction on 57.9% of samples. Non-Latin scripts produce the widest cross-provider variance. On Arabic, the spread between top and bottom systems exceeds 75 percentage points. Structural hallucinations are pervasive. The second-ranked system achieves a perfect-extraction rate of 28.6%, meaning structural or content errors on 71.4% of tables (fabricated rows, invented content, incorrect span attributes, shifted data). Coverage failure is underreported. Multiple evaluated systems return no output on 19% to 21% of samples. Raw accuracy numbers without coverage disclosure favor selection bias. Thank you to Dushyanth Sekhar and Mohammed Hadi of S&P Global's Enterprise Data Organization for their academic contributions to the benchmark methodology. Dataset: huggingface.co/datasets/puls… Evaluation: github.com/Pulse-Software-Co… Blog: runpulse.com/blog/pulsebench… Research methodology: benchmark.runpulse.com/resea… Viewer: benchmark.runpulse.com
1
4
457
Pulse retweeted
Pulse just open-sourced PulseBench-Tab, a frontier benchmark for table extraction across 9 languages, alongside T-LAG, a new scoring metric that evaluates structure and content in one unified score. After processing billions of pages for Fortune 50 enterprises, the largest investment firms, and the leading AI startups, @Pulse__AI is opening up the evaluation methodology behind how they train and measure their models. Most existing benchmarks miss the structural relationships that determine whether a table was actually understood, and T-LAG fixes that. Pulse now sets the standard for what production-grade extraction actually requires. Congrats on the launch, @sid_mnk and @ritvikpandey21! runpulse.com/blog/pulsebench…
5
7
45
8,041
Pulse retweeted
Apr 21
Ultra 2 from @Pulse__AI, our highest-accuracy document extraction model and the one now running in production at some of the largest banks and Fortune 100 enterprises. A year of work went into a new training corpus, a new annotation pipeline, and an entirely new architecture built from the ground up. The previous generation routed every document through a chain of independent models: OCR, layout detection, table detection, cell extraction, structure reconstruction. Each hop was a new failure point, and upstream mistakes cascaded downstream. Pulse Ultra 2 replaces that entire pipeline with a unified end-to-end architecture. One model, one forward pass, jointly solving layout understanding, text recognition, cell detection, and structure prediction. The accuracy gains show up most on the hardest document layouts: merged cells, borderless tables, multilingual documents across 100 languages, and complex multi-table pages. Given the high computational demand, we're rolling out with specific rate limits for non-enterprise accounts while we scale capacity. Reach out to our team with the link below for higher rate limits or dedicated capacity.
5
3
10
269
Pulse retweeted
Mar 12
Footnote extraction is now available on the @Pulse__AI platform. Every footnote is returned as a structured object with its reference marker, full text, and positional metadata linking it back to exactly where it was cited in the document. Available now here: platform.runpulse.com/
1
5
19
6,662
Pulse retweeted
Feb 18
The dirty secret of document AI: most accuracy problems aren't model problems. They're scoping problems. If your schema is running against the full document, you're asking your model to find signal across cover pages, appendices, boilerplate, and blank pages. It doesn't know what matters. You do. A billion pages taught us the fix is upstream: define which pages are relevant before extraction runs. Clean inputs, clean outputs. @Pulse__AI is launching Split for public access today.
6
4
31
7,549
Pulse retweeted
Agentic OCR is everywhere in document AI conversations right now. We published a breakdown of what it actually means, the tradeoffs that matter for production, and why hybrid architectures usually win. More below!
4
2
11
959
Pulse retweeted
Jan 14
shoutout @NotionHQ
3
3
15
2,927
Pulse retweeted
We assumed structured outputs had solved document extraction. Then we tried complex schemas at scale. New post on the computational complexity of schema-guided extraction: why JSON schemas require pushdown automata, how state explosion happens, and why tighter constraints can actually hurt accuracy.
2
2
8
319
Pulse retweeted
Most spreadsheet errors are not obvious failures. They are structural ones. Flatten the grid and you lose meaning before the model ever reasons. After millions of XLSX pages in production, the takeaway was clear. Representation was the bottleneck, not scale. Advanced spreadsheet parsing is now generally available in @Pulse__AI - due to high demand, message our team to get started. Better structure. Fewer silent errors. Real business impact at production scale.
4
3
20
7,838
Pulse retweeted
Accuracy is only part of the document AI problem. In production systems, meaning depends on layout. Tables, hierarchy, and proximity determine how values should be interpreted and verified. Layout segmentation and bounding boxes preserve that structure. They enable citations, traceability, and review long after extraction. We wrote a technical post on why layout needs to be treated as a first-class primitive in document AI systems, with examples from finance, healthcare, and legal workflows. Link below.
1
2
9
370
Pulse retweeted
Merry Christmas! A common request we get is how to share extraction results with someone who doesn't have a @Pulse__AI account. Now you can. Shared links with configurable expiration and org-level visibility controls. Simple feature, but it removes friction in a lot of workflows.
1
2
8
364
Pulse retweeted
Webhooks are now live in @Pulse__AI . Real-time job notifications, auto-retries, signed payloads. If you're running async extraction at scale, no more polling. Small addition, but it came up a lot.
2
2
6
326