3 pointsby jfyi2 hours ago3 comments
  • jfyi2 hours ago
    I'm not associated with the project. I just think they are doing amazing work as of the recent document drops.
  • u1hcw9nx2 hours ago
    Just because the site says comprehensive does not mean it is comprehensive. Multiple names other databases find are not mentioned. Start from Joscha Bach...

    DOJ has more comprehensive search functionality.

    https://www.justice.gov/epstein/search

  • randlet2 hours ago
    /r/epstein post from the creator:

    https://reddit.com/r/Epstein/comments/1r3joqr/i_mapped_every...

    -------

    A week ago I posted about an open database I’ve been building to cross reference Epstein case material. That post did way better than I expected (568k views, 4.6k upvotes) and it hugged my server to death twice.

    Since then I basically did nothing but ingest, clean, and index more data. The database is now big enough that “just read the docs” is not advice, it’s a cry for help. What it was last week

        ~6,000 documents
        1,708 flights
        2,700 emails
        1,438 people
    
    What it is now

        1,522,060 documents (all DOJ releases we have access to so far), full text searchable
        1,708 flights (1997 to 2019) with manifests where available
        10,000+ emails indexed with threading
        1,350 people (cleaned: removed duplicates + nuked a bunch of false connections)
        638,000 docs run through redaction analysis
            ~1.8M individual redactions detected
            ~616k flagged by our tooling as “looks questionable, take a closer look”
            ~39,500 pages of text recovered from under black bars (you can see examples on the site)
        107,000 named entities pulled out via NLP (people, orgs, places, dates)
        1,530 audio/video transcripts
        4,300+ photos/media (raid photos, exhibits, property shots, government releases)
    
    That’s not a typo: 1.5 million documents. If you search a phrase, it searches inside the actual pages (OCR where needed) and email bodies, not just titles.

    So what changed, besides “everything is bigger”? 1) The redaction stuff is getting hard to ignore

    I’m not saying “every redaction is evil.” Some of them obviously protect victims, minors, addresses, etc. But the patterns are weird, and the volume is insane.

    I also worked with u/Sea_Doughnut_8853, who independently processed 519k PDFs with their own pipeline. That let us sanity check a lot of what we’re seeing across the corpus.

    We’re flagging ~616k redactions as “potentially improper” based on patterns (context, repetition, surrounding text). That does not mean “definitely corrupt.” It means “this is the pile worth human eyes.”

    We also recovered a lot of hidden text. If you want to judge it yourself, the doc pages show the redaction density and any recovered text we can reliably extract. 2) Entity extraction is the only way to deal with this scale

    107,000 entities means you can stop playing whack a mole with PDFs. It’s still not “truth,” it’s just structure. But structure beats drowning. 3) This week’s real world developments are in there too

    If you missed the news cycle, Congress has been pressuring DOJ about redactions, and Rep. Ro Khanna read six previously redacted names on the House floor:

        Leslie Wexner
        Salvatore Nuara
        Zurab Mikeladze
        Leonic Leonov
        Nicola Caputo
        Sultan Ahmed bin Sulayem
    
    Important caveat: being named in a document is not proof of wrongdoing. People show up in emails, contact lists, forwarded threads, or because someone mentioned them.

    Related:

        Reporting says Wexner’s name appeared in an internal FBI document as “co conspirator,” but he has not been charged.
        Maxwell invoked the Fifth in a House Oversight deposition and her lawyer floated testimony in exchange for clemency.
        House Oversight depositions are scheduled: Wexner (Feb 18), Richard Kahn (Feb 25), Darren Indyke (Mar 5), plus Hillary Clinton (Feb 26) and Bill Clinton (Feb 27).
    
    All of those items are indexed, with the underlying documents linked where available. New tools since last week

        Full text search: search inside 1.5M documents, 28k OCR entries, and 10k emails
        AI research assistant: ask a question in plain English, get an answer with citations back to the source docs so you can verify it yourself
        Degrees of separation: shortest documented path between two people, with the supporting flights/docs shown at each hop
        Redaction analysis on every doc page: how heavy, what got flagged, what got recovered
        Investigation Dossiers (new today): community made evidence boards
            pin any person/doc/flight/email
            add notes
            upvotes + comments
            “community notes” style fact checks
            sorting like hot/new/top
            I put up 14 starter dossiers so it’s not an empty ghost town
    
    What still bugs me

    The government didn’t just withhold whole documents. In a lot of places, it looks like they blacked out specific names or transactions inside documents they did release. Maybe there are legit reasons for some of it. But at this volume, it needs scrutiny.

    Also, the 2013 to 2019 passenger manifest gap is still a thing in the public record. Tons of flights, but not the corresponding names. The database

    Everything is at EpsteinExposed.com. Free. No ads. No paywall. You can browse without logging in. Accounts are only for making dossiers and posting notes.

    There’s also a community forum for collab research: https://board.epsteinexposed.com

    If you find errors, call them out. If you want a specific thread turned into a dossier, say the name and I’ll help you get it set up. TL;DR

    The database went from ~6k docs to 1.5M in a week. Full text searchable. We ran redaction analysis at scale, flagged a huge pile for human review, recovered a lot of hidden text, and the current Congress/DOJ redaction fight is now fully indexed in the same place. Update:

    I went to sleep thinking this would be a normal update post and woke up to it hitting r/popular / r/all.

    Thank you. Seriously.

    In ~4 hours this hit ~750k views and people have already donated ~$800. That is wild, and it genuinely helps keep the lights on while I keep ingesting and cleaning data and everything goes toward making the site better!

    A quick housekeeping thing because it needs to be said on posts like this:

    Being named in a document is not proof of wrongdoing. People show up in emails, contact lists, forwarded threads, or because someone mentioned them.

    Please don’t dox, harass, or post “I found their address” type stuff. If you want this taken seriously by journalists and agencies, it has to stay clean and source-based.

    If you spot bad OCR, duplicates, broken links, or a false connection, call it out. That kind of boring cleanup work is how this gets stronger.

    If you want to help, the best thing is still commenting and sharing. Second best is reporting errors or building a dossier on a specific thread so the research is organized and verifiable.

    Also, small but important technical update: Semantic / Smart search is going live soon. Keyword search is great, but it misses anything that is phrased differently. Smart search uses a hybrid approach so you can search meaning, not just exact words. It’s already wired up, I’m generating the embeddings now and seeding them into the database next.