🎉Congratulations to our partner
@databricks on the launch of the OfficeQA Benchmark.
Enterprises can use the OfficeQA Benchmark to measure whether AI systems can handle the messy, high-precision tasks found in real business workflows. Teams can now more easily identify gaps, compare models, and make informed decisions about when AI is ready for deployment.
The benchmark was developed using a large dataset: nearly 89,000 pages of historical U.S. Treasury Bulletins (documents spanning decades, with scanned pages, PDFs, complex tables, charts, figures, and mixed unstructured structured data).
📣SuperAnnotate is proud to have powered the dataset and annotation rubrics behind this benchmark and to collaborate with the incredible Databricks team - Arnav Singhvi, Krista Opsahl-Ong, Jasmine Collins,
@ivanzhouyq,
@cindyxinyiwang, Ashutosh Baheti, Jacob Portes, Sam Havens, Erich Elsen, Michael Bendersky,
@matei_zaharia, Xing Chen.
Today we’re introducing OfficeQA, a new benchmark grounded in ~89,000 pages of U.S. Treasury Bulletins that reflects the complex, document-heavy tasks enterprises actually face.
Unlike existing benchmarks, OfficeQA measures economically valuable, real-world reasoning: parsing dense tables, navigating scanned PDFs, and retrieving facts across decades of documents.
Even strong agents reach only ~45% accuracy, showing how far the field has to go. The benchmark is now open to the community, and the Databricks Grounded Reasoning Cup in Spring 2026 will challenge teams to push these capabilities forward.
databricks.com/blog/introduc…