Resume parsing is the foundation of every ATS filtering tool. So I looked at the benchmarks.
The results aren't great.
An EMNLP 2025 study in November called ResumeBench tested 24 LLMs on structured resume extraction: 2,500 resumes, 50 templates, 5 languages, real JSON schema scoring (the models are a bit out of date):
→ GPT-4o struggles with multi-column layouts and cross-lingual structure
→ Code-specialized models actually perform worse than generalists
→ JSON mode helps schema compliance but doesn't fix semantic errors
→ Smaller models collapse nested job histories: merging roles, dropping bullets → Reasoning models performed worse than their base counterparts
→ Most of the llms use pypdf/pytext which can have issues with pdf resume formats or image based resumes
If the parsing is unreliable, every decision on matching, ranking, screening inherits that unreliability. And now, it can become a legal liability after the recent Eightfold class-action lawsuit (using AI to covertly score and rank candidates: scraping social media, location data, and browsing activity without disclosure)
I'm also watching Google's new open-source tool LangExtract closely. Different approach: every piece of extracted data maps back to its exact location in the original document, with a visual verification layer. That kind of traceability matters a lot more now especially for resumes.
@classet_ai, we've taken a different stance on this entirely. We treat the resume as optional supplemental material vs the source of truth. Our AI-powered phone interviews let candidates give context that a resume never captures: why they left a role, what tools they actually used daily, whether they're open to relocation. The conversation is the core data.
On the parsing side, we don't trust any single approach. We blend multiple AI models with traditional parsing engines like Senseloaf to make sure we're not missing structured data. If one model drops a certification or merges two jobs together, another catches it.
Resume parsing is not a solved problem. Stop making resume filtering the entire decision.
ALT 24 LLMs tested on 2,500 resumes across 50 templates, 30 career fields, and 5 languages.
Even the best models struggle with structural accuracy, and every hiring decision downstream inherits that error.