93 pointsby jandrewrogers5 days ago2 comments
  • eisa01a day ago
    A lot of upvotes, but no discussion :)

    How would you approach migrating a ~5 TB OLTP database that fundamentally constain analytical time series data? I’d think e.g., Apache Iceberg could be a better data store, and make writing much easier (almost just dump a parquet in there)

    It’s exposed to the outside world via APIs

    • marklita day ago
      That 5 TB of data will probably be 3-400 GB in Parquet. Try and denormalise the data into a few datasets or just one dataset if you can.

      DuckDB querying the data should be able to return results in milliseconds if the smaller columns are being used a better if the row-group stats can be used to answer queries.

      You can host those Parquet files on a local disk or S3. A local disk might be cheaper if this is exposed to the outside world as well as giving you a price ceiling on hosting.

      If you have a Parquet file with billions of records and row-groups measuring into the thousands then hosting on something like Cloudflare where there is a per-request charge could get a bit expensive if this is a popular dataset. At a minimum, DuckDB will look at the stats for each row-group for any column involved with a query. It might be cheaper just to pay for 400 GB of storage with your hosting provider.

      There is a project to convert OSM to Parquet every week and we had to look into some of those issues https://github.com/osmus/layercake/issues/22

  • victor1062 days ago
    Great article.

    Any other resources that provide a comprehensive treatment of open table formats?