Ask HN: How to Structure Gnarly PDFs

2 pointsby rllearner7 hours ago1 comment

AlbertoGP6 hours ago
I might be missing something, but parsing the HTML, even with the different formats, should be much simpler than the PDF form.
In 20 years I would guess they used no more than 20 formats, which is doable even if writing XPath (perhaps CSS selectors would suffice) by hand.
Do you mean that the mutual fund complex includes many funds and you get as many different formats for a same time period?
- rllearner6 hours ago
  Thanks for the response!
  For sure I could write heuristics for parsing each format. I was kind of hoping that ML algorithms had advanced to the state where they could handle messy tables in documents. (By the way if they have, that could be big for the companies with good structuring models. Financial data is unbelievably expensive and a lot of it is publicly available but badly organized, so structuring companies could conceivably eat that those markets as just one application of their tools. Starting with cheap stuff for hobbyists/students who can't afford the commercial solutions).
  The complex includes 20 or so funds, so each file includes a "hot spot" with data that I'd like to extract. Within a filing the holdings tables all look the same. The format of the document changes from year to year. Unfortunately the tables aren't really formatted as tables in the html, so I thought rendering to pdf and passing off to an LLM might be the best thing to do. I posted links to a few examples below.
  https://www.sec.gov/Archives/edgar/data/36405/00011046592508...
  https://www.sec.gov/Archives/edgar/data/36405/00009324710500...