We took a close look at several public table extraction benchmarks, including OmniDocBench, SCORE-Bench, ParseBench, and RD-TableBench, and found that every one of them has a structural/methodology issue that doesn’t properly evaluate document intelligence at scale.
1/ OmniDocBench's TEDS metric conflates formatting with structure. A 3x3 table with identical content scores differently based on whether it uses <thead> wrappers or plain <tr>.
2/ SCORE-Bench's spatial tolerance parameter can hide serious failures. Drop a financial table's header row, shift data up by one, and the benchmark reports high content accuracy while the headers are gone.
3/ ParseBench relies on frontier VLMs for ground truth, introducing model bias into the benchmark itself. Its table metric treats records as unordered bags, so column transposition and row reordering go unpenalized. The table set is also 503 pages, 54.5% from a single SERFF source, and English-only.
4/ RD-TableBench linearizes 2D tables into 1D sequences, so column swaps can align well despite being structurally wrong. We also audited all 1,000 ground truth files and found 43 with verifiable errors and 89 byte-identical to one provider's output.
That's why we built PulseBench-Tab, a benchmark that is multilingual by design, uses 2D-aware scoring that preserves horizontal and vertical adjacency, cleanly separates structure from formatting, and is fully open from the dataset to the scoring code. Full breakdown by the
@Pulse__AI team in comments.