I manage ~160 TB of Apache Iceberg table data across multiple S3-compatible backends (Leaseweb object storage, not AWS). The AWS console and mc CLI were the only options for browsing, and both are painfully slow for large buckets — 14 seconds to search in the console, 3 minutes to enumerate with mc.
The core idea is simple: a background crawler indexes every object key into SQLite FTS5 (about 1,300 objects/sec), and then search is just a local full-text query. No external database needed — each bucket gets its own SQLite file in WAL mode.
A few things I'm particularly happy with: - Parquet/ORC/Avro schema preview without downloading the file (reads just the footer bytes via range requests) - Version scanner that finds hidden delete markers and ghost objects that the S3 API doesn't surface in normal listings - Works the same across AWS, MinIO, R2, Wasabi, B2, Ceph — tested against all of them
What I'm still figuring out: how to handle buckets with 10M+ objects efficiently. The current crawler works well up to ~500K but I'd love ideas on scaling the indexing beyond that.
Happy to answer questions about the architecture or S3 provider quirks.