4 pointsby ashwathstephen3 hours ago1 comment
  • ashwathstephen2 hours ago
    Hi, I'm the author. Some context on why I built this:

    I manage ~160 TB of Apache Iceberg table data across multiple S3-compatible backends (Leaseweb object storage, not AWS). The AWS console and mc CLI were the only options for browsing, and both are painfully slow for large buckets — 14 seconds to search in the console, 3 minutes to enumerate with mc.

    The core idea is simple: a background crawler indexes every object key into SQLite FTS5 (about 1,300 objects/sec), and then search is just a local full-text query. No external database needed — each bucket gets its own SQLite file in WAL mode.

    A few things I'm particularly happy with: - Parquet/ORC/Avro schema preview without downloading the file (reads just the footer bytes via range requests) - Version scanner that finds hidden delete markers and ghost objects that the S3 API doesn't surface in normal listings - Works the same across AWS, MinIO, R2, Wasabi, B2, Ceph — tested against all of them

    What I'm still figuring out: how to handle buckets with 10M+ objects efficiently. The current crawler works well up to ~500K but I'd love ideas on scaling the indexing beyond that.

    Happy to answer questions about the architecture or S3 provider quirks.