82 pointsby mustaphah7 hours ago10 comments
  • n_u30 minutes ago
    Cool! I'd love to know a bit more about the replication setup. I'm guessing they are doing async replication.

    > We added nearly 50 read replicas, while keeping replication lag near zero

    I wonder what those replication lag numbers are exactly and how they deal with stragglers. It seems likely that at any given moment at least one of the 50 read replicas may be lagging cuz CPU/mem usage spike. Then presumably that would slow down the primary since it has to wait for the TCP acks before sending more of the WAL.

  • winterrxan hour ago
    First OpenAI Engineering blog? I'm definitely interested in seeing more and how they handled the rapid growth.
  • everfrustrated5 hours ago
    This is why I love Postgres. It can get you to being one of the largest websites before you need to reconsider your architecture just by throwing CPU and disk at it. At that point you can well afford to hire people who are deep experts at sharding etc.
    • zozbot2343 hours ago
      PostgreSQL actually supports sharding out of the box, it's just a matter of setting up the right table partitioning and using Foreign Data Wrapper (FDW) to forward queries to remote databases. I'm not sure what the post is referencing when they say that sharding requires leaving Postgres altogether.
      • dmix2 hours ago
        This is specifically what they said about sharding

        > The primary rationale is that sharding existing application workloads would be highly complex and time-consuming, requiring changes to hundreds of application endpoints and potentially taking months or even years

        • simonwan hour ago
          Genuinely sounds like the kind of challenge that could be solved with a swarm of Codex coding agents. I'm surprised they aren't treating this as an ideal use-case to show off their stack!
          • csto1218 minutes ago
            I read your message, guessed the author, and I’m happy to announce I guessed correctly.
          • aisuxmorethanhn4 minutes ago
            It wouldn’t work.
        • zozbot2342 hours ago
          I know they said that, but in fact sharding is entirely a database-level concern. The application need not be aware of it at all.
          • EB662 hours ago
            Sharding can be made mostly transparent, but it's not purely a DB-level concern in practice. Once data is split across nodes, join patterns, cross-shard transactions, global uniqueness, certain keys hit with a lot of traffic, etc matter a lot. Even if partitioning handles routing, the application's query patterns and its consistency/latency requirements can still force application-level changes.
  • dbuser99an hour ago
    I don’t get it. This whole thing says single writer does not scale, so we stopped writing as much and removed reads away from it, so it works ok and we decided that’s enough. I guess thats great.
  • bzmrgonz4 hours ago
    Someone ask Microsoft what does it feel to be bested by an open source project on their very own cloud platform!!! Lol.
    • doodlesdev5 minutes ago
      They don't care. Azure has a revenue higher than GCP, losing only to AWS. It's Microsoft's new baby, and they love it, no matter what you want to run there. Also, they're still the 4th largest company by market cap.

      Honestly, only us nerds in Hacker News care about this kind of stuff :) (and that's why I love it here).

    • beoberha32 minutes ago
      Are you saying this because OpenAI didnt choose SQL Server?
      • csto1217 minutes ago
        In 2026 is SQL Server ever the answer?
    • DLA2 hours ago
      And the same for Linux boxes on Azure - they dominate Windows servers by a huge margin.
    • esjeon3 hours ago
      Azure offers Postgres “DBaaS”, so I’m pretty sure they are no where near that stage. It’s more likely that we should watch out for the Microsoft E-E-E strategy.
  • ahmetozer43 minutes ago
    ai written blog, its very generic and same context is repated many times
  • QuiCasseRien3 hours ago
    I like the way of thinking. Instead of migrating to another database, they keep that awesome one running and found smart workaround to push limits.
    • hahahahhaahan hour ago
      It is what mature engineering does. Migrations are not fun.
  • ed_mercer5 hours ago
    Why does the [Azure PostgreSQL flexible server instance] link point to Chinese Azure?
    • noxs3 hours ago
      All names are Asian and mostly Chinese

      > Author Bohan Zhang

      > Acknowledgements Special thanks to Jon Lee, Sicheng Liu, Chaomin Yu, and Chenglong Hao, who contributed to this post, and to the entire team that helped scale PostgreSQL. We’d also like to thank the Azure PostgreSQL team for their strong partnership.

    • Natfan4 hours ago
      Bohan Zhang, the article's author, is likely Chinese.

      e: and the link points to en-us at time of writing. I frankly don't see the value in your comment.

  • resdevan hour ago
    They could’ve used mongodb which is web scale NoSQL database because SQL is 1990’s era legacy technology.

    /s

  • hu35 hours ago
    From what I understand they basically couldn't scale writes in PostgreSQL to their needs and had to offload what they could to Azure's NoSQL database.

    I wonder, is there another popular OLTP database solution that does this better?

    > For write traffic, we’ve migrated shardable, write-heavy workloads to sharded systems such as Azure CosmosDB.

    > Although PostgreSQL scales well for our read-heavy workloads, we still encounter challenges during periods of high write traffic. This is largely due to PostgreSQL’s multiversion concurrency control (MVCC) implementation, which makes it less efficient for write-heavy workloads. For example, when a query updates a tuple or even a single field, the entire row is copied to create a new version. Under heavy write loads, this results in significant write amplification. It also increases read amplification, since queries must scan through multiple tuple versions (dead tuples) to retrieve the latest one. MVCC introduces additional challenges such as table and index bloat, increased index maintenance overhead, and complex autovacuum tuning.

    • 0xdeafbeef5 hours ago
      Tidb should handle it nice. I've wrote 200к inserts / sec for hour in peak. Underlying lsm works better for writes
      • anonzzzies4 hours ago
        That would mean it improved somewhat. We always got better write performance from mysql vs postgres, however that is a while ago; we then tried tidb to go further but it was basically rather slow. Again, a while ago.

        When did you get your results, might be time to re-evaluate.