1 pointby frumu4 hours ago1 comment
  • frumu4 hours ago
    I ran into Cloudflare’s Markdown for Agents and thought it was exactly what I needed for LLM web research. Then I realized it only helps when a site is on Cloudflare and has it enabled, so it doesn’t solve “open web” extraction.

    I built a simple HTML→Markdown pipeline in Rust that works on any public URL (strip scripts/styles/boilerplate, preserve structure + links). On a 100-URL set it reduced input size by ~70–80% (often close to 80%).

    Benchmark on the same 100 URLs:

    Rust server mode: p50 ~0.4s, p95 ~1.3s, memory ~100MB stable

    Node baseline (JSDOM + Turndown): p50 ~1.2s, p95 ~50s, memory grew into hundreds of MB to GBs

    Scripts + methodology are in the repo: <link>

    Curious what others use for boilerplate removal and how you keep p95 tails under control when parsing nasty pages.