Show HN: I just built a scanned PDF text extractor for public PDFs (1-300 page)(readplace.com)

1 pointby fagnerbrackan hour ago1 comment

fagnerbrackan hour ago
For comparison: Claude only uses OCR for the first 100 pages, then falls back to text-only extract. Public URL in, HTML page out, AI throughout up to 300 pages (spartaaaaa!).
Conveniently, that's also roughly where the cost math stops working for a free tool. Scanned PDFs are best-effort OCR. Multi-page tables spanning sheets are still a weak spot.
Here's a link you can check:
https://people.math.harvard.edu/~ctm/home/text/others/shanno...
Feel free to try with your own PDF links to see what breaks, it will help improving the crawl logic and the parser (I still need to get some rate limits up)