83 pointsby DuffJohnson2 hours ago9 comments
  • waynenilsenan hour ago
    > Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above.

    hopefully someone is independently archiving all documents

    my understanding is that some are being removed

    • some_random25 minutes ago
      Are they being removed or replaced with more heavily redacted documents? There were definitely some victim names that slipped through the cracks that have since been redacted.
      • VeninVidiaVicii13 minutes ago
        EFTA01660679 was completely removed. It was a table with witness tips against Donald Trump.
    • embedding-shapean hour ago
      Initially under "Epstein Files Transparency Act (H.R.4405)" on https://www.justice.gov/epstein/doj-disclosures, all datasets had .zip links. I first saw that page when all but dataset 11 (or 10) had a .zip link. At one point this morning, all the .zip links were removed, now it seems like most are back again.
    • littlecorner18 minutes ago
      I think some of the released documents included images of victims, which where redacted. So it's not necessarily malicious removals
  • originalvichy42 minutes ago
    Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower).
    • torh41 minutes ago
      Was on the frontpage yesterday: https://news.ycombinator.com/item?id=46868759
    • ripe31 minutes ago
      The equal characters are due to poor handling of quoted-printable in email.

      The author of gnus, Lars Ingebrigtsen, wrote a blog post explaining this. His post was on the HN front page today.

      • originalvichy20 minutes ago
        He explained the newline thing that confused me. Good read!
  • embedding-shapean hour ago
    Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.
    • originalvichy39 minutes ago
      Did you take any steps to decrease the dimension size of images, if this increases the performance? I have not tried this as I have not peformed an OCR task like this with an LLM. I would be interested to know at what size the vlm cannot make out the details in text reliably.
      • embedding-shape37 minutes ago
        The performance is OK, takes a couple of seconds at most on my GPU, just the amount of documents to get through that takes time, even with parallelism. The dimension seems fine as it is, as far as I can tell.
    • helterskelter37 minutes ago
      [flagged]
      • embedding-shape36 minutes ago
        Haven't seen anything particular about that, but lots of the documents with names that were half-redacted contain OCRd text that is completely garbled, but olmocr-2-7b seems to handle it just fine. Unsure if they just had sucky processes or if there is something else going on.
        • helterskelter32 minutes ago
          Might be a good fit for uploading a git repo and crowdsourcing
  • _def11 minutes ago
    I can't even download the archive, the transmission always terminates just before its finished. Spooky.
  • bugeats36 minutes ago
    Somebody ought to train an LLM exclusively on this text, just for funsies.
    • pc8621 minutes ago
      DeepSeek-V4-JEE
  • corygarmsan hour ago
    These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files.
  • nkozyra43 minutes ago
    > DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata

    Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?

    • originalvichy34 minutes ago
      Maybe they know more than we do. It may be possible to tamper with files at a deeper level. I wonder if it is also possible to use some sort of tampered compression algorithm that could mark images much like printers do with paper.

      Another guess is that perhaps the step is a part of a multi-step sanitation process, and the last step(s) perform the bitmap operation.

      • normalaccess18 minutes ago
        I'm not sure about computer image generation but you can (relatively) easily fingerprint images generated by digital cameras due to sensor defects. I'll bet there is a similar problem with PC image generation where even without the EXIF data there is probably still too much side channel data leakage.
  • meidan_y2 hours ago
    (2025) just follow hn guideline, impressive voter ring though
    • 2 hours ago
      undefined
    • alain940402 hours ago
      We're in early February 2025 [edit:2026] and the article was written on Dec 23, 2025, which makes it less than two months old. I think it's ok not to include a year in the submission title in that case.

      I personally understand a year in the submission as a warning that the article may not be up to date.

      • petepete2 hours ago
        We're in Feb 2026.

        I'm not used to typing it yet, either.

      • embedding-shapean hour ago
        Less about the age, and more about confusing what they are analyzing, for the files that were just released like a week ago.
      • GlitchRider472 hours ago
        Generally, I'd agree with you. However, the recent Epstein file dump was in 2026, not 2025, so I would say it is relevant in this case..
      • michaelmcdonald2 hours ago
        "We're in early February ~2025~ *2026*"
  • tibbon2 hours ago
    That's a lot of PeDoFiles!

    (But seriously, great work here!)

    • ted_bunnya few seconds ago
      Elite PDF File ring