I also wrote an extraction tool that;
extracts all pictures to files names associated with what pdf they came from
ocr the jpg for text, and if more than 8 characters extracts the text to a txt file
leaves the org files should you need to revisit them
makes the dump searchable locally.
Ill like the repo if anyone from the media is interested, but didn't have the manpower to do this manually.