Pg_lake: Postgres with Iceberg and data lake access(github.com)

371 pointsby plaur7823 months ago38 comments

boshomi3 months ago
Why not just use Ducklake?[1] That reduces complexity[2] since only DuckDB and PostgreSQL with pg_duckdb are required.
[1] https://ducklake.select/
[2] DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen: https://www.youtube.com/watch?v=YQEUkFWa69o
- mslot3 months ago
  DuckLake is pretty cool, and we obviously love everything the DuckDB is doing. It's what made pg_lake possible, and what motivated part of our team to step away from Microsoft/Citus.
  DuckLake can do things that pg_lake cannot do with Iceberg, and DuckDB can do things Postgres absolutely can't (e.g. query data frames). On the other hand, Postgres can do a lot of things that DuckDB cannot do. For instance, it can handle >100k single row inserts/sec.
  Transactions don't come for free. Embedding the engine in the catalog rather than the catalog in the engine enables transactions across analytical and operational tables. That way you can do a very high rate of writes in a heap table, and transactionally move data into an Iceberg table.
  Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.
  There's also the interoperability aspect of Iceberg being supported by other query engines.
  - jabr3 months ago
    How does this compare to https://www.mooncake.dev/pgmooncake? It seems there are several projects like this now, with each taking a slightly different approach optimized for different use cases?
    mslot3 months ago
    Definitely similar goals, from the Mooncake author: https://news.ycombinator.com/item?id=43298145
    I think pg_mooncake is still relatively early stage.
    There's a degree of maturity to pg_lake resulting from our team's experience working on extensions like Citus, pg_documentdb, pg_cron, and many others in the past.
    For instance, in pg_lake all SQL features and transactions just work, the hybrid query engine can delegate different fragments of the query into DuckDB if the whole query cannot be handled, and having a robust DuckDB integration with a single DuckDB instance (rather than 1 per session) in a separate server process helps make it production-ready. It is used in heavy production workloads already.
    No compromise on Postgres features is especially hard to achieve, but after a decade of trying to get there with Citus, we knew we had to get that right from day 1.
    Basically, we could speed run this thing into a comprehensive, production-ready solution. I think others will catch up, but we're not sitting still either. :)
    3 months ago
    undefined
    j_kao3 months ago
    FYI the mooncake team was acquired by Databricks so it's basically vendors trying to compete on features now :)
  - mritchie7123 months ago
    > For instance, it can handle >100k single row inserts/sec.
    DuckLake already has data-inlining for the DuckDB catalog, seems this will be possible once it's supported in the pg catalog.
    > Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.
    This is true, but it's not clear where I'd use this in practice. e.g. if I need to run a complex ETL job, I probably wouldn't do it in pg_cron.
    derefr3 months ago
    > This is true, but it's not clear where I'd use this in practice. e.g. if I need to run a complex ETL job, I probably wouldn't do it in pg_cron.
    Think "tiered storage."
    See the example under https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/ice...:
    select cron.schedule('flush-queue', '* * * * *', $$ with new_rows as ( delete from measurements_staging returning * ) insert into measurements select * from new_rows; $$);
    The "continuous ETL" process the GP is talking about would be exactly this kind of thing, and just as trivial. (In fact it would be this exact same code, just with your mental model flipped around from "promoting data from a staging table into a canonical iceberg table" to "evicting data from a canonical table into a historical-archive table".)
  - ijustlovemath3 months ago
    Not to mention one of my favorite tools for adding a postgres db to your backend service: PostgREST. Insanely powerful DB introspection and automatic REST endpoint. Pretty good performance too!
  - anktor3 months ago
    What does data frames mean in this context? I'm used to them in spark or pandas but does this relate to something in how duckDB operates or is it something else?
    dunefox3 months ago
    It's a python data frame.
- pgguru3 months ago
  Boils down to design decisions; see: https://news.ycombinator.com/item?id=45813631
- swasheck3 months ago
  i have so desperately wanted to love and use ducklake, but have come across some issues with it in practice (pg catalog). they seem to have to do with maintenance activities and ducklake suddenly throwing http/400 errors on files it created. i’m not sure if it’s due to my write patterns (gather data from sources into a polars dataframe and insert into the ducklake table from the df) into the partitioned tables or something else.
  it’s ok in dev/test and for me as the person in the team who’s enamored with duckdb, but it’s made the team experience challenging and so i’ve just kinda reverted to hive partitioned parquet files with a duckdb file that has views created on top of the parquet. attach that file as read only and query away.
  i may work up a full example to submit as an issue but up until now too may other things are dominating my time.
ozgune3 months ago
This is huge!
When people ask me what’s missing in the Postgres market, I used to tell them “open source Snowflake.”
Crunchy’s Postgres extension is by far the most ahead solution in the market.
Huge congrats to Snowflake and the Crunchy team on open sourcing this.
- gigatexal3 months ago
  Honestly. Just pay snowflake for the amazing DB and ecosystem it is. And then go build cool stuff unless your value add to customers is infra let them handle all that.
  - thejosh3 months ago
    Sounds great until you're locked into Snowflake - so glad iceberg is becoming the standard, anything is great.
    The trap you end up in is you have to pay snowflake to access your data, iceberg and other technology help with the walled garden.
    Not just snowflake, any pay on use provider.
    (Context - have spent 5+ years working with Snowflake, it's great, have built drivers for various languages, etc).
    gigatexal3 months ago
    Locked in? I mean they’re your partner. As long as you’re deriving value from them the partnership is still valuable no?
    thejosh3 months ago
    Everytime you want to query your data, you need to pay the compute cost.
    If instead you can write to something like Parquet/Iceberg, you're not paying for access your data.
    Snowflake is great at aggregations and other stuff (seriously, huge fan of snowflakes SQL capabilities), but let's say you have a visualisation tool, you're paying for pulling data out .
    Instead, writing data to something like S3, you instead can hookup your tools to this.
    It's expensive to pull data out of Snowflake otherwise.
    gigatexal3 months ago
    You people can’t be serious, right?
    Ok so I build my data lake on s3 using all open tech. I’m still paying for S3 for puts and reads and lists.
    Ok I put it on my own hardware. In my own colo. you’re still paying electricity and other things. Everything is lock in.
    On top of that you’re beholden to an entire community of people and volunteers to make your tech work. Need a feature? Sponsor it. Or write it and fight to upstream it. On top of that if you do this at scale at a company what about the highly paid team of engineers you have to have to maintain all this?
    With snowflake I alone could provide an entire production ready bi stack to a company. And I can do so and sleep well at night knowing it’s managed and taken care of and if it fails entire teams of people are working to fix it.
    Are you going to build your own roads, your own power grid, your own police force?
    Again my point remains. The vast majority of times people build on a vendor as a partner and then go on to build useful things.
    Apple using cloud vendors for iCloud storage. You think they couldn’t do it themselves? They couldn’t find and pay and support all the tech their own? Of course they could. But they have better things to do than to reinvent the wheel I.e building value on top of dumb compute and that’s iCloud.
    thejosh3 months ago
    After running Snowflake in production for 5+ years I would rather have my data on something like Parquet/Iceberg (which Snowflake fully supports...) than in the table format Snowflake has.
    It's not that deep
    gigatexal3 months ago
    Ok. And this flexibility is only really possible since they did a lot of work to make external and internal tables roughly equivalent in performance.
    thejosh3 months ago
    Yeah, performance depends.
    I think a hybrid approach works best (store on Snowflake native and iceberg/tables where needed), and allows you the benefit of Snowflake without paying the cost for certain workloads (which really adds up).
    We're going to see more of this (either open or closed source), since Snowflake has acquired Crunchydata, and the last major bastion is "traditional" database <> Snowflake.
    gigatexal3 months ago
    I had no idea they did. This pg lake announcement dropped that nugget and i was surprised.
    gigatexal3 months ago
    Agreed btw.
    nxm3 months ago
    They didn't do it out of good will. They realized that's where the market was going and if their query engine didn't perform as well as others on top of iceberg, then they'd be another Oracle in the long-term.
    kortilla3 months ago
    Yes, don’t be obtuse. “Vendor lock-in” is not some foreign unheard of concept.
    repeekad3 months ago
    Teams of the smartest people on earth make these kind of big vendor decisions, vendor lock-in is top of mind, I tell anyone who will listen to avoid databricks live tables and their sleezy sales reps pushing it over cheaper less locked in solutions
    nojvek3 months ago
    Not all vendors are same. Snowflake charges an arm and leg for compute.
    It’s 36x more expensive than equivalent EC2 compute.
    enether3 months ago
    yeah, this exchange reads like a sales ad
    kdazzle3 months ago
    Snowflake is expensive, even compared to Databricks, and you pay their pre-AWS discount storage price while they get the discount and pocket the difference as profit
anentropic3 months ago
When Snowflake bought Crunchy Data I was hoping they were going to offer a managed version of this
It's great that I can run this locally in a Docker container, I'd love to be able to run a managed instance on AWS billed through our existing Snowflake account
gajus3 months ago
Man, we are living in the golden era of PostgreSQL.
- 3 months ago
  undefined
NeutralCrane3 months ago
I’m not a data engineer but work in an adjacent role. Is there anyone here who could dumb the use case down? Maybe an example of a problem this solves. I am struggling to understand the value proposition here.
- ggregoire3 months ago
  > Maybe an example of a problem this solves.
  Some service writes a lot of data in parquet files stored on S3 (e.g. logs), and now you want that data to be queryable from your application as if it was in postgres (e.g. near real-time analytics dashboard). pg_lake allows you to load these parquet files into postgres and query the data. You can also join that data with existing tables in postgres.
  - smithclay3 months ago
    Been experimenting with OpenTelemetry->Parquet conversion lately for logs, metrics, and traces. Lots of related projects popping up in this area. It's powerful and cheap.
    - https://github.com/smithclay/otlp2parquet (shameless plug, based on Clickhouse's Otel schema) - https://github.com/Mooncake-Labs/moonlink (also has OTLP support) - https://github.com/open-telemetry/otel-arrow (official community project under early dev)
  - NeutralCrane3 months ago
    I guess my confusion is that there already are ways to query this data with DuckDB or something like that. So is the magic here that it’s Postgres? What makes being able to query something in Postgres special? And when we say it’s now queryable by Postgres, does this mean that it takes that data and stores it in your PG db? Or it remains in S3 and this is a translation layer for querying with PG?
    ch71r223 months ago
    Not sure if I have this right but this is how I understand it
    > So is the magic here that it's Postgres? What makes being able to query something in Postgres special?
    There are a bunch of pros and cons to using Postgres vs. DuckDB. The basic difference is OLTP vs. OLAP. It seems pg_lake aims to give you the best of both. You can combine analytics queries with transactional queries.
    pg_lake also stores and manages the Iceberg catalog. If you use DuckDB you'll need to have an external catalog to get the same guarantees.
    I think if you're someone who was happy using Postgres, but had to explore alternatives like DuckDB because Postgres couldn't meet your OLAP needs, a solution like pg_lake would make your life a lot simpler. Instead of deploying a whole new OLAP system, you basically just install this extension and create the tables you want OLAP performance from with `create table ... using iceberg`
    > when we say it’s now queryable by Postgres, does this mean that it takes that data and stores it in your PG db?
    Postgres basically stores pointers to the data in S3. These pointers are in the Iceberg catalog that pg_lake manages. The tables managed by pg_lake are special tables defined with `create table ... using iceberg` which stores the data in Iceberg/Parquet files on S3 and executes queries partially with the DuckDB engine and partially with the Postgres engine.
    It looks like there is good support for copying between the Iceberg/DuckDB/Parquet world and the traditional Postgres world.
    > Or it remains in S3 and this is a translation layer for querying with PG?
    Yes I think that's right -- things stay in S3 and there is a translation layer so Postgres can use DuckDB to interact with the Iceberg tables on S3. If you're updating a table created with `create table ... using iceberg`, I think all the data remains in S3 and is stored in Parquet files, safely/transactionally managed via the Iceberg format.
    https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/ice...
ayhanfuat3 months ago
With S3 Table Buckets, Cloudflare R2 Data Catalog and now this, Iceberg seems to be winning.
fifilura3 months ago
How do you use your data lake? For me it is much more than just storing data, it is just as much for crunching numbers in unpredictable ways.
And this is where postgres does not cut it.
You need some more CPU and RAM than what you pay for in your postgres instance. I.e. a distributed engine where you don't have to worry about how big your database instance is today.
- wodenokoto3 months ago
  The point about a datalake is to separate computer and storage. Postgres isn’t a compute layer it’s an access layer.
  Your compute asks Postgres “what is the current data for these keys?” Or “what was the current data as of two weeks ago for these keys?” And your compute will then download and aggregate your analytics query directly from the parquet files.
  - enether3 months ago
    but most serious compute engines already speak Iceberg, what do they gain from interfacing with PG now?
    My understanding is the opposite - PG cuts it as a compute layer for small amounts of data, and this is where it excels.
    I also assume `pg_lake` was built mainly with the intention of creating/writing tables, and the ability to read comes "for free" as an extra, since Iceberg integration is already written.
  - fifilura3 months ago
    Sounds more like you need postgres as a backend than vice versa.
hamasho3 months ago
I like data lakes and their SQL-like query languages. They feel like an advanced version of the "Everything is a file" philosophy.
Under "Everything is a file", you can read or manipulate a wide variety of information via simple, open/read/write() APIs. Linux provides APIs to modify system settings via filesystem. Get the screen brightness setting? `cat /sys/class/backlight/device0/brightness`. Update? `echo 500 > /sys/class/backlight/device0/brightness`. No need for special APIs, just generic file operations and the kernel handles everything.
FUSE (Filesystem in Userspace) provides even more flexibility by allowing user space programs to build their own drivers that handle any data operation via the filesystem. You can mount remote systems (via SSH) and google drive, and copying files is as easy as `cp /mnt/remote/data/origin /mnt/googledrive/data/`. Or using unique FUSE like pgfuse and redisfs, updating redis value by postgres DB data is just `cat /mnt/postgres/users/100/full_name > /mnt/redis/user_100_full_name`.
But filesystems are only good for hierarchical data while a lot of real world data is relational. Many FUSE software tries hard to represent inherently non-hierarchical data in a filesystem. Data lake allows to use SQL, the elegant abstraction for relational data, across different underlying data sources. They can be physically distant and have different structures. A lot of real world applications are just CRUD on relational data. You can accomplish much more much easier if those data are just a big single database.
dharbin3 months ago
Why would Snowflake develop and release this? Doesn't this cannibalize their main product?
- barrrrald3 months ago
  One thing I admire about Snowflake is a real commitment to self-cannibalization. They were super out front with Iceberg even though it could disrupt them, because that's what customers were asking for and they're willing to bet they'll figure out how to make money in that new world
  Video of their SVP of Product talking about it here: https://youtu.be/PERZMGLhnF8?si=DjS_OgbNeDpvLA04&t=1195
  - qaq3 months ago
    Have you interacted with Snowflake teams much? We are using external iceberg tables with snowflake. Every interaction pretty much boils down to you really should not be using iceberg you should be using snowflake for storage. It's also pretty obvious some things are strategically not implemented to push you very strongly in that direction.
    barrrrald3 months ago
    Not surprised - this stuff isn’t fully mature yet. But I interact with their team a lot and know they have a commitment to it (I’m the other guy in that video)
    ozkatz3 months ago
    Out of curiosity - can you share a few examples of functionality currently not supported with Iceberg but that works well with their internal format?
    qaq3 months ago
    even partition elimination is pretty primitive. For Query optimizer Iceberg is really not a primary target. The overall interaction with even technical people gives strong this is a sales org that happens to own an OLAP db product vibe.
    andiz3 months ago
    I have to very much disagree on that. All pruning techniques in Snowflake work equally well both on their proprietary format as well for Iceberg tables. Iceberg is nowadays a first-class citizen in Snowflake, with pruning working at the file level, row group level, and page level. Same is true for other query optimization techniques. There is even a paper on that: https://arxiv.org/abs/2504.11540
    Where pruning differences might arise for Iceberg tables is the structure of Parquet files and the availability of metadata. Both depend on the writer of the Parquet files. Metadata might be completely missing (e.g., no per column min/max), or partially missing (e.g., no page indexes), which will indeed impact the perf. This is why it's super important to choose a writer that produces rich metadata. The metadata can be backfilled / recomputed after the fact by the querying engine, but it comes at a cost.
    Another aspect is storage optimization: The ability to skip / prune files is intrinsically tied to the storage optimization quality of the table. If the table is neither clustered nor partitioned, or if the table has sub-optimally sized files, then all of these things will severely impact any engine's ability to skip files or subsets thereof.
    I would be very curious if you can find a query on an Iceberg table that shows a better partition elimination rate in a different system.
    qaq3 months ago
    sure select distinct customer_id ... customer_id is first part of partition key you really don't need to do a tablescan to resolve that do you ?
  - blef3 months ago
    Supporting Iceberg is eventually having people leaving you because they have better elsewhere, but this is birectionnal, it means you can welcome people from Databricks because you have feature parity.
- kentm3 months ago
  It's not going to scale as well as Snowflake, but it gets you into an Iceberg ecosystem which Snowflake can ingest and process at scale. Analytical data systems are typically trending to heterogenous compute with a shared storage backend -- you have large, autoscaling systems to process the raw data down to something that is usable by a smaller, cheaper query engine supporting UIs/services.
  - hobs3 months ago
    But if you are used to this type of compute per dollar what on earth would make you want to move to Snowflake?
    kentm3 months ago
    Different parts of the analytical stack have different performance requirements and characteristics. Maybe none of your stack needs it and so you never need Snowflake at all.
    More likely, you don't need Snowflake to process queries from your BI tools (Mode, Tableau, Superset, etc), but you do need it to prepare data for those BI tools. Its entirely possible that you have hundreds of terabytes, if not petabytes, of input data that you want to pare down to < 1 TB datasets for querying, and Snowflake can chew through those datasets. There's also third party integrations and things like ML tooling that you need to consider.
    You shouldn't really consider analytical systems the same as a database backing a service. Analytical systems are designed to funnel large datasets that cover the entire business (cross cutting services and any sharding you've done) into subsequently smaller datasets that are cheaper and faster to query. And you may be using different compute engines for different parts of these pipelines; there's a good chance you're not using only Snowflake but Snowflake and a bunch of different tools.
- mslot3 months ago
  When we first developed pg_lake at Crunchy Data and defined GTM we considered whether it could be a Snowflake competitor, but we quickly realised that did not make sense.
  Data platforms like Snowflake are built as a central place to collect your organisation's data, do governance, large scale analytics, AI model training and inference, share data within and across orgs, build and deploy data products, etc. These are not jobs for a Postgres server.
  Pg_lake foremost targets Postgres users who currently need complex ETL pipelines to get data in and out of Postgres, and accidental Postgres data warehouses where you ended up overloading your server with slow analytical queries, but you still want to keep using Postgres.
- 9999000009993 months ago
  It'll probably be really difficult to set up.
  If it's anything like super base, your question the existence of God when trying to get it to work properly.
  You pay them to make it work right.
  - pgguru3 months ago
    For testing, we at least have a Dockerfile to automate the setup of the pgduck_server and a minio instance so it Just Works™ with the extensions installed in your local Postgres cluster (after installing the extensions).
    The configuration mainly involves just defining the default iceberg location for new tables, pointing it to the pgduck_server, and providing the appropriate auth/secrets for your bucket access.
enether3 months ago
Do I understand it correctly that DuckDB would run embedded on the machine running Postgres (i.e through the extension), and this limits query processing ability to whatever that machine can comfortably handle?
What are the deployment implications if one wants to integrate this in production? Surely they'd need a much larger Postgres machine at a minimum.
Is there concern re: "hot neighbour" problems if the DuckDB queries get too heavy? How is that sort of issue potentially handled? I understood from another query that DuckDB is ran in a separate process, so there is room to potentially throttle it
darth_avocado3 months ago
This is so cool! We have files in Iceberg that we then move data to/from to a PG db using a custom utility. It always felt more like a workaround that didn’t fully use the capabilities of both the technologies. Can’t wait to try this out.
max_streese3 months ago
Two questions:
(1) Are there any plans to make this compatible with the ducklake specification? Meaning: Instead of using Iceberg in the background, you would use ducklake with its SQL tables? My knowledge is very limited but to me, besides leveraging duckdb, another big point of ducklake is that it's using SQL for the catalog stuff instead of a confusing mixture of files, thereby offering a bunch of advantages like not having to care about number of snapshots and better concurrent writes.
(2) Might it be possible that pg_duckdb will achieve the same thing in some time or do things not work like that?
- mslot3 months ago
  (1) We've thought about it, no current plans. We'd ideally reimplement DuckLake in Postgres directly such that we can preserve Postgres transaction boundaries, rather than reuse the Ducklake implementation that would run in a separate process. The double-edged sword is that there's a bunch of complexity around things like inlined data and passing the inlined data into DuckDB at query time, though if we can do that then you can get pretty high transaction performance.
  (2) In principle, it's a bit easier for pg_duckdb to reuse the existing Ducklake implementation because DuckDB sits in every Postgres process and they can call into each other, but we feel that architecture is less appropriate in terms resource management and stability.
oulipo23 months ago
Interesting! How does it compare with ducklake?
- mslot3 months ago
  You could say
  With DuckLake, the query frontend and query engine are DuckDB, and Postgres is used as a catalog in the background.
  With pg_lake, the query frontend and catalog are Postgres, and DuckDB is used as a query engine in the background.
  Of course, they also use different table formats (though similar in data layer) with different pros and cons, and the query frontends differ in significant ways.
  An interesting thing about pg_lake is that it is effectively standalone, no external catalog required. You can point Spark et al. directly to Postgres with pg_lake by using the Iceberg JDBC driver.
dkdcio3 months ago
I was going to ask if you could then put DuckDB over Postgres for the OLAP query engine -- looks like that's already what it does! very interesting development in the data lake space alongside DuckLake and things
- pgguru3 months ago
  You create foreign tables in postgres using either the pg_lake_table wrapper or pg_lake_iceberg.
  Once those tables exist, queries against them are able to either push down entirely to the remote tables and uses a Custom Scan to execute and pull results back into postgres, or we transform/extract the pieces that can be executed remotely using a FDW and then treat it as a tuple source.
  In both cases, the user does not need to know any of the details and just runs queries inside postgres as they always have.
  - spenczar53 months ago
    I think I don't understand postgres enough, so forgive this naive question, but what does pushing down to the remote tables mean? Does it allow parallelism? If I query a very large iceberg table, will this system fan the work out to multiple duckdb executors and gather the results back in?
    pgguru3 months ago
    In any query engine you can execute the same query in different ways. The more restrictions that you can apply on the DuckDB side the less data you need to return to Postgres.
    For instance, you could compute a `SELECT COUNT(*) FROM mytable WHERE first_name = 'David'` by querying all the rows from `mytable` on the DuckDB side, returning all the rows, and letting Postgres itself count the number of results, but this is extremely inefficient, since that same value can be computed remotely.
    In a simple query like this with well-defined semantics that match between Postgres and DuckDB, you can run the query entirely on the remote side, just using Postgres as a go-between.
    Not all functions and operators work in the same way between the two systems, so you cannot just push things down unconditionally; `pg_lake` does some analysis to see what can run on the DuckDB side and what needs to stick around on the Postgres side.
    There is only a single "executor" from the perspective of pg_lake, but the pgduck_server embeds a multi-threaded duckdb instance.
    How DuckDB executes the portion of the query it gets is up to it; it often will involve parallelism, and it can use metadata about the files it is querying to speed up its own processing without even needing to visit every file. For instance, it can look at the `first_name` in the incoming query and just skip any files which do not have a min_value/max_value that would contain that.
    spenczar53 months ago
    Thanks for the detailed answer!
    I use DuckDB today to query Iceberg tables. In some particularly gnarly queries (huge DISTINCTs, big sorts, even just selects that touch extremely heavy columns) I have sometimes run out of memory in that DuckDB instance.
    I run on hosts without much memory because they are cheap, and easy to launch, giving me isolated query parallism, which is hard to achieve on a single giant host.
    To the extent that its possible, I dream of being able to spread those gnarly OOMing queries across multiple hosts; perhaps the DISTINCTs can be merged for example. But this seems like a pretty complicated system that needs to be deeply aware of Iceberg partitioning ("hidden" in pg_lake's language), right?
    Is there some component in the postgres world that can help here? I am happy to continue over email, if you prefer, by the way.
    pgguru3 months ago
    Well, dealing with large analytics queries will always perform better with larger amounts of memory... :D You can perhaps tune things to perform based on the amount of system memory (IME 80% is what DuckDB targets if not otherwise configured). Your proposed system does sounds like it introduces quite a bit of complexity that would be better served just by using hosts with more memory.
    As far as Iceberg is concerned, DuckDB has its own implementation, but we do not use that; pg_lake has its own iceberg implementation. The partitioning is "hidden" because it is separated out from the schema definition itself and can be changed gradually without the query engine needing to care about the details of how things are partitioning at read time. (For writes, we respect the latest partitioning spec and always write according to that.)
    enether3 months ago
    What does "remotely" mean in this context? My understanding is that all of this runs on the same machine - your Postgres server machine runs DuckDB on the same machine via the extension.
    I assume you simply mean DuckDB, being a columnar engine, is more efficient in doing this work than PG is
whalesalad3 months ago
RDS really needs to make it easy to install your own PG modules.
- anentropic3 months ago
  110% this!
harisund19903 months ago
This is cool to see! Looks like a compete against pg_mooncake which Databricks acquired. But how is this different from pg_duckdb?
lysecret3 months ago
A usecase I see for this personally I have encountered a lot of “hot cache for some time then offload for historical queries” usecases which I have built by hand multiple times. This should be a great fit. E.g. write to Postgres then periodically offload to lakehouse and even query together (if needed). Very cool!
spenczar53 months ago
Very cool. One question that comes up for me is whether pg_lake expects to control the Iceberg metadata, or whether it can be used purely as a read layer. If I make schema updates and partition changes to iceberg directly, without going through pg_lake, will pg_lake's catalog correctly reflect things right away?
- pgguru3 months ago
  We have some level of external iceberg table read-only support, but it is limited at the moment. See this example/caveat: https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/fil...
- mslot3 months ago
  You can use it as a read layer for for specific metadata JSON URL or a table in a REST catalog. The latter got merged quite recently, not yet in docs.
flarco3 months ago
For anyone looking to easily ingest data into a Postgres Wire compatible database, check out https://github.com/slingdata-io/sling-cli. Use CLI, YAML or Python to run etl jobs.
drchaim3 months ago
More integrations are great. Anyway, the "this is awesome" moment (for me) will be when you could mix row- and column-oriented tables in Postgres, a bit like Timescale but native Postgres and well done. Hopefully one day.
- gregw23 months ago
  I want MPP HTAP where SQL inserts/COPYs store data in three(!) formats: - row-based (low latency insert, fast row-based indexed query for single-row OLTP) - columnar-based (slow inserts/updates, fast aggregates/projections) - iceberg-columnar-based (better OLAP price/performance and less lockin than native columnar) And for SELECTs the query engine picks which storage engine satisfies the query using some SQL extension like DB2 "WAITFORDATA" or TiDB @@tidb_read_staleness or MemSQL columnstore_latency and/or similar signalling for performance-vs-cost preference.
  And a common permissioning/datasharing layer so I can share data to external and internal parties who can in turn bring their own compute to make their own latency choices.
- pgguru3 months ago
  Hypertables definitely had the arrays columns auto-expanding with the custom node type. Not sure what else it would look like for what you describe.
  That said, don't sleep on the "this is awesome" parts in this project... my personal favorite is the automatic schema detection:
``` CREATE TABLE my_iceberg_table () USING iceberg WITH (definition_from = 's3://bucket/source_data.parquet'); ```
- 3 months ago
  undefined
lysecret3 months ago
Nice does this also allow me to write to parquet from my Postgres table?
- mslot3 months ago
  Yes, just COPY table TO 's3://mybucket/data.parquet'
  Or COPY table TO STDOUT WITH (format 'parquet') if you need it on the client side.
- lysecret3 months ago
  Ong yes it works this would have made my past job so much easier.
pjd73 months ago
This is awesome, I will be trying this out in the coming months. Its just made it to the top of my R&D shortlist for things that could massively simplify our data stack for a b2b saas.
inglor3 months ago
This is really nice though looking at the code - a lot of the postgres types are missing as well a lot of the newer parquet logical types - but this is a great start and a nice use of FDW.
- pgguru3 months ago
  Hi, what types are you expecting to see that aren't supported? I believe we had support for most/all builtin postgres types.
  - inglor3 months ago
    Postgres has like 300+ types but mostly stuff like decimals should work the same way it does with Postgres (with the edge cases like NaN existing in Postgres but not parquets accordingly)
    mslot3 months ago
    In principle, Postgres has an infinite number of possible types :).
    pg_lake maps types into their Parquet equivalent and otherwise stores as text representation, there are a few limitations like very large numerics.
    https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/ice...
- inglor3 months ago
  Also, any planned support for more catalogs?
  - pgguru3 months ago
    I think we have recently merged (or are getting ready to merge) REST catalog support, so that will open some things up in this department.
fridder3 months ago
I love this. There are definitely shops where the data is a bit too much for postgres but something like Snowflake would be overkill. Wish this was around a couple years ago lol
apexalpha3 months ago
I’m not super into the Data sphere but my company relies heavily on Snowflake which is becoming an issue.
This announcement seems huge to me, no?!
Is this really an open source Snowflake covering most use cases?
- taude3 months ago
  there's also plenty of other options for warehouse/compute processing of iceberg data storage.
  I think this is a pretty big deal, though.
  Snowflake does a lot more, though, especially around sharing data across company boundaries.
iamcreasy3 months ago
Very cool! Was there any inherent limitation with postgresql or its extension system that forced pg_lake to use duckdb as query engine?
- mslot3 months ago
  I gave a talk on that at Data Council, then still discussing the pg_lake extensions as part of Crunchy Data Warehouse.
  https://youtu.be/HZArjlMB6W4?si=BWEfGjMaeVytW8M1
  Also, nicer recording from POSETTE: https://youtu.be/tpq4nfEoioE?si=Qkmj8o990vkeRkUa
  It comes down to the trade-offs made by operational and analytical query engines being fundamentally different at every level.
- pgguru3 months ago
  DuckDB provided a lot of infrastructure for reading/writing parquet files and other common formats here. It also was inherently multi-threaded and supported being embedded in a larger program (similar to sqllite), so made it a good basis for something that could work outside of the traditional process model of Postgres.
  Additionally, the postgres extension system supports most of the current project, so wouldn't say it was forced in this case, it was a design decision. :)
claudeomusic3 months ago
Can someone dumb this down a bit for a non data-engineer? Hard to fully wrap my head around who this is/isn’t best suited for.
- lysecret3 months ago
  One usecase we have (we built it ourselves) is to periodically offload data from Postgres to lake house partitioned data on GCS. The way I see it this can now be done with a single query. Another one is the other way around to use posters as a query engine or to merge offloaded data with your live data.
  - claudeomusic3 months ago
    What services do people primarily use to accomplish these tasks today? All custom work?
mberning3 months ago
Does anyone know how access control works to the underlying s3 objects? I didn’t see anything regarding grants in the docs.
- pgguru3 months ago
  Hi, one of the developers here. You define credentials that can access the S3 buckets and use those as DuckDB secrets, usually in an init script for pgduck_server. (You can see some examples of this in the testing framework.)
  I'll see if we can improve the docs or highlight that part better, if it is already documented—we did move some things around prior to release.
  - mberning3 months ago
    Interesting. I am working on a project to integrate access management to iceberg/parquet files for sagemaker. Controlling what users logged into sagemaker studio have access to in s3. It’s fine using static policies for mvp, but eventually it needs to be dynamic and integrated into enterprise iam tools. Those tools generally have great support for managing sql grants. Not so much for s3 bucket policies.
    pgguru3 months ago
    DuckDB secrets management supports custom IAM roles and the like; at this point we are basically treating the pgduck_server external system as a black box.
    For the postgres grants themselves, we provide privs to allow read/write to the remote tables, which is done via granting the `pg_lake_read`, `pg_lake_write` or `pg_lake_read_write` grants. This is a blanket all-or-nothing grant, however, so would need some design work/patching to support per-relation grants, say.
    (You could probably get away with making roles in postgres that have the appropriate read/write grant, then only granting those specific roles to a given relation, so it's probably doable though a little clunky at the moment.)
  - onderkalaci3 months ago
    Maybe this could help: https://github.com/Snowflake-Labs/pg_lake?tab=readme-ov-file...
- mslot3 months ago
  There are Postgres roles for read/write access to the S3 object that DuckDB has access to. Those roles can create tables from specific files or at specific locations, and can then assign more fine-grained privileges to other Postgres roles (e.g. read access on a specific view or table).
chaps3 months ago
I love postgres and have created my own "data lake" sorta systems -- what would this add to my workflows?
iamcreasy3 months ago
If anyone from Supabase is reading, it would be awesome to have this extension!
logicartisan3 months ago
It’s amazing to see Postgres growing into something this powerful
scirob3 months ago
Crunchydata did it first :) but nice to get more options
- mslot3 months ago
  It's the same team and same project :). Crunchy Data was acquired by Snowflake.
  - scirob3 months ago
    Holy shit thats amazing!! Congrats to the team
beoberha3 months ago
Curious why pgduck_server is a totally separate process?
- rmnclmnt3 months ago
  The README explains it:
  > This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.
- pgguru3 months ago
  What has been pointed out from the README; also:
  - Separation of concerns, since with a single external process we can share object store caches without complicated locking dances between multiple processes. - Memory limits are easier to reason about with a single external process. - Postgres backends end up being more robust, as you can restart the pgduck_server process separately.
- dkdcio3 months ago
  from the README:
  > This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.
  - beoberha3 months ago
    Thanks! Didn’t scroll down far enough
hamonrye3 months ago
[dead]
CJlll3 months ago
[dead]
patokkuyak3 months ago
[dead]
rizky053 months ago
[dead]
hexo3 months ago
Oh datalakes. The most ridiculous idea in data processing, right after data frames in python.
We've had this discussion like a week ago about how stupid is to use filesystem for this kind of data storage and here we go again. Actually i had to implement this "idea" in practice. What a nonsense.