A Java library for extracting tables from Text-Based PDFs and scanned PDFs(github.com)

1 pointby mehulimukherjee11 hours ago1 comment

mehulimukherjee11 hours ago
Hi HN,
Over the past year I’ve been working on ExtractPDF4J, an open-source Java library for extracting tables from real-world PDFs.
Many document processing pipelines rely on PDFs like bank statements, financial reports, or invoices. In practice these files are inconsistent: some are text-based, others are scanned images, and many contain irregular layouts or multi-page tables.
Most existing tools in this space are Python-based (like Camelot or Tabula). In JVM-heavy environments this often means running a separate Python service or building a hybrid stack.
ExtractPDF4J was designed to solve this problem directly in Java.
Key ideas behind the project:
• Hybrid parsing strategies (stream + lattice detection) • OCR fallback for scanned documents • CLI and service modules for production workflows • Maven Central distribution for easy integration
The latest release also introduced a BOM module to simplify dependency management and a full documentation site.
Project: https://github.com/ExtractPDF4J/ExtractPDF4J
Docs: https://extractpdf4j.github.io/ExtractPDF4J/
I’d really appreciate feedback from people who have dealt with messy PDF extraction problems. Suggestions and contributions are welcome. Star the repo for more reach to the Java community. Thank you!