Github π¨βπ§: A Comprehensive Toolkit for High-Quality PDF Content Extraction
β Integrates leading document parsing models for layout detection, formula detection, formula recognition, OCR, and table recognition.
β Achieves high-quality parsing across diverse document types due to fine-tuning on varied document annotation data.
β Provides comprehensive PDF evaluation benchmarks, aiding users in selecting suitable models based on performance results.
β Includes pre-trained models for layout detection, formula detection, formula recognition, OCR, and table recognition.
β Supports core document parsing tasks: layout detection, formula detection, formula recognition, OCR, and table recognition. Reading order functionality is planned.