Polars Cloud: The Distributed Cloud Architecture to Run Polars Anywhere(pola.rs)

252 pointsby neilfrndes2 days ago22 comments

noworriesnate2 days ago
Every time I build something complex with dataframes in either R or Python (Pandas, I haven't used Polars yet), I end up really wishing I could have statically typed dataframes. I miss the security of knowing that when I change common code, the compiler will catch if I break a totally different part of the dashboard for instance.
I'm aware of Pandera[1] which has support for Polars as well but, while nice, it doesn't cause the code to fail to compile, it only fails at runtime. To me this is the achilles heel of analysis in both Python and R.
Does anybody have ideas on how this situation could be improved?
[1] https://pandera.readthedocs.io/en/stable/
- chrisaycocka day ago
  Statically typed dataframes are exactly why I created the Empirical programming language:
  https://www.empirical-soft.com
  It can infer the column names and types from a CSV file at compile time.
  Here's an example that misspells the "ask" column as if it were plural:
  let quotes = load("quotes.csv") sort quotes by (asks - bid) / bid
  The error is caught before the script is run:
  Error: symbol asks was not found
  I had to use a lot of computer-science techniques to get this working, like type providers and compile-time function evaluation. I'm really proud of the novelty of it and even won Y Combinator's Startup School grant for it.
  Unfortunately, it didn't go anywhere as a project. Turns out that static typing isn't enough of a selling point for people to drop Python. I haven't touched Empirical in four years, but my code and my notes are still publicly available on the website.
  - noworriesnatea day ago
    Wow this is amazing!! Thanks for sharing!
    I love how you really expanded on the idea of executing code at compile time. You should be proud.
    You probably already know this but for people like me to switch "all" it would take would be:
    1. A plotting library like ggplot2 or plotnine
    2. A machine learning library, like scikit
    3. A dashboard framework like streamlit or shiny
    4. Support for Empirical in my cloud workspace environment, which is Jupyter based, and where I have to execute all the code, because that's where the data is and has to stay due to security
    Just like how Polars is written in Rust and has Python bindings, I wonder if there's a market for 1 and 2 written in Rust and then having bindings to Python, Empirical, R, Julia etc. I feel like 4 is just a matter of time if Empirical becomes popular, but I think 3 would have to be implemented specifically for Empirical.
    I think the idea of statically typed dataframes is really useful and you were ahead of your time. Maybe one day the time will be right.
  - theLiminatora day ago
    Does this require that the file is available locally or does it do network io at compile time?
    chrisaycocka day ago
    The inferencing logic needs to sample the file, so (1) the file path must be determined at compile time and (2) the file must be available to be read at compile time. If neither condition is true---like the filename is a runtime parameter, for example---then the user must supply the type in advance.
    There is no magic here. No language can guess the type of anything without seeing what the thing is.
- briankelly2 days ago
  Scala Spark - a bit absurd if you don't need the parallelism, though. Most of the development can be done simply in quick compilation iterations or copied from the sbt REPL. Python/pandas feels Stone Age in comparison - you absolutely waste a lot of time iterating with run-time testing.
  - Centigonal2 days ago
    Why scala spark over pyspark?
    smu3l2 days ago
    Scala (and Java) has a typed Dataset api.[0] pyspark only provides the Dataframe API, which is not typed.
    [0] https://spark.apache.org/docs/latest/sql-programming-guide.h...
    Centigonala day ago
    thanks!
- akdor11542 days ago
  The pandas mypy stubs attempt to address this to some extent, but to be honest.. It's really painful. Not helped by pandas' hodgepodge API design to be fair, but i think even a perfect API would still be annoying to statically type. Imagine needing to annotate every function that takes a data frame with 20 columns...
  A tantalising idea i have not explored, is to try and hook up polars' lazy query planner to a static typing plugin. The planner already has basically complete knowledge of the schema at every point, right?
  So in theory this could be used to give the really good inference abilities that a static typing system needs to be nice to use.
  - theLiminator2 days ago
    Depends, it's resolved at runtime, so there's no way to have truly "compile-time" static schema (unless you specify a schema upfront).
- TheTaytaya day ago
  I agree with this so much! I recently started using patito, which is a typesafe pydantic based library for Polars. I’m not really deep into it yet, but I prefer polars syntax and the extra functions that Patito adds to the dataframes. (https://patito.readthedocs.io/en/latest/)
  Otherwise, it feels so broken to just pass a dataframe around. It’s like typing everything as a “dict” and hoping for the best. It’s awful.
- enugua day ago
  Polars is also usable as a Rust library. So, one can use that for static typing. Wonder what the downsides are - maybe losing access to the Python data science libraries.
  - antonvsa day ago
    Polars dataframes in Rust are still dynamically typed. For example:
    let df = df![ "name" => ["Alice", "Bob", "Charlie"], "age" => [25, 30, 35] ]?; let ages = df.column(“age”)?;
    There’s no Rust type-level knowledge of what type the “age” or “name” column is, for example. The result of df.column is a Series, which has to be cast to a Rust type based on the developer’s knowledge of what the column is expected to contain.
    You can do things like this:
    let oldies = df.filter(&df.column("age")?.gt(30)?)?;
    So the casting can be automatic, but this will fail at runtime if the age column doesn’t contain numeric values.
    One type-related feature that Polars does have is because the contents of a Series is represented as a Rust value, all values in a series must have the same type. This is a constraint compared to traditional dataframes, but it provides a performance benefit when processing large series. You can cast an entire Series to a typed Rust value efficiently, and then operate on the result in a typed fashion.
    But as you said, you can’t use Python libraries directly with Polars dataframes. You’d need conversion and foreign function interfaces. If you need that, you’d probably be better off just using Python.
    lmeyerova day ago
    Pandas, dask, etc use also have runtime typed cols (dtypes), which is even stronger in pandas 2 and when used with arrow to go to data representation typing for interop/io. (Half of the performance trick of polars.)
    And yeah my ??? with all these is, lacking dependent typing or equivalent for row types, it's hard for mypy and friends to statically track individual columns existing and being specific types. And even if we are willing to be explicit about wrapping each DF with a manual definition, basically an arrow schema, I don't think any of these libraries make that convenient? (And is that natively supported by any?)
    In louie.ai, we generate python for users, so we can have it generate the types as well... But we haven't found a satisfactory library for that so far...
    enugua day ago
    Thanks, I am in the process of choosing a dataframe library and just naively assumed that the Rust interface would be statically typed.
- dharmatech2 days ago
  Frames is a type safe dataframe library for Haskell:
  https://hackage.haskell.org/package/Frames
- ants_everywhere2 days ago
  I agree, and I suspect there are large numbers of unknown bugs in a lot of data frame based applications.
  But to do it right you'd need a pretty good type system because these applications implicitly use a lot of isomorphisms between different mathematical objects. The current solution is just to ignore types and treat everything as a bag of floats with some shape. If you start tracking types you need a way to handle these isomorphisms.
- jamesblondea day ago
  If you use a feature store to store your DataFrames (most provide APIs for storing Polars, Pandas, PySpark DataFrames in backing Lakehouse/real-time DBs), then you get type checks when writing data to the DataFrame's backing Feature Group (Lakehouse + real-time tables).
  Many also add an additional layer of data validation on top of schema validation, using frameworks like Great Expectations. For example, it's not enough to know 'age' is an Integer, it should be an integer in the range 0..150.
  Disclaimer: i work for Hopsworks.
- Centigonal2 days ago
  It's really not the same as inbuilt strict typing, but we addressed this issue by running all of our "final" data products through a Great Expectations[1] suite that was autogenerated from a YAML schema.
  [1] https://docs.greatexpectations.io/docs/core/introduction/gx_...
Starlord20482 days ago
I can appreciate the pain points you guys are addressing.
The "diagonal scaling" approach seems particularly clever - dynamically choosing between horizontal and vertical scaling based on the query characteristics rather than forcing users into a one-size-fits-all model. Most real-world data workloads have mixed requirements, so this flexibility could be a major advantage.
I'm curious how the new streaming engine with out-of-core processing will compare to Dask, which has been in this space for a while but hasn't quite achieved the adoption of pandas/PySpark despite its strengths.
The unified API approach also tackles a real issue. The cognitive overhead of switching between pandas for local work and PySpark for distributed work is higher than most people acknowledge. Having a consistent mental model regardless of scale would be a productivity boost.
Anyway, I would love to apply for the early access and try it out. I'd be particularly interested in seeing benchmark comparisons against Ray, Dask, and Spark for different workload profiles. Also curious about the pricing model and the cold start problem that plagues many distributed systems.
- scrlk2 days ago
  Ibis also solves this problem by providing a portable dataframe API that works across multiple backends (DuckDB by default): https://ibis-project.org/
  - ritchie46a day ago
    Disclosure, I am the author of Polars and this post. The difference with Ibis is that Polars cloud will also manage hardware. It is similar to Modal in that sense. You don't have to have a running cluster to fire a remote query.
    The other is that we are only focussing on Polars and honor the Polars semantics and data model. Switching backends via Ibis doesn't honor this, as many architectures have different semantics regarding NaNs, missing data, order of them, decimal arithmetic behavior, regex engines, type upcasting, overflowing, etc.
    And lastly, we will ensure it works seamlessly with the Polars landscape, that means that Polars Plugins and IO plugins will also be first class citizens.
    TheTaytay20 hours ago
    It’s funny you mention Modal. I use modal to do fan-out processing of large-ish datasets. Right now I store the transient data in duckdb on modal, using polars (and sometimes ibis) as my api of choice.
    I did this, rather than use snowflake, because our custom python “user defined functions” that process the data are not deployable on snowflake out of the gate, and the ergonomics of shipping custom code to modal are great, so I’m willing to pay a bit more complexity to ship data to modal in exchange for these great dev ergonomics.
    All of that is to say: what does it look like to have custom python code running on my polars cloud in a distributed fashion? Is that a solved problem?
  - ZeroTalent2 days ago
    I've played around a bit with ibis for some internal analytics stuff, and honestly it's pretty nice to have one unified api for duckdb, postgres, etc. saves you from a ton of headaches switching context between different query languages and syntax quirks. but like you said, performance totally depends on the underlying backend, and sometimes that's a mixed bag—duckdb flies, but certain others can get sluggish with more complex joins and aggregations.
    polars cloud might have an advantage here since they're optimizing directly around polars' own rust-based engine. i've done a fair bit of work lately using polars locally (huge fan of the lazy api), and if they can translate that speed and ergonomics smoothly into the cloud, it could be a real winner. the downside is obviously potential lock-in, but if it makes my day-to-day data wrangling faster, it might be worth the tradeoff.
    curious to see benchmarks soon against dask, ray, and spark for some heavy analytics workloads.
  - theLiminator2 days ago
    My experience with it is that it's decent, but a "lowest-common denominator" solution. So you can write a few things agnostically, but once you need to write anything moderately complex, it gets a little annoying to work with. Also a lot of the backends aren't very performant (perhaps due to the translation/transpilation).
  - Starlord20482 days ago
    wow, ibis supports nearly 20 backends, that's impressive
  - codydkdc2 days ago
    without locking you into a single cloud vendor ;)
0cf8612b2e1e2 days ago
I’ll bite- what’s the pitch vs Dask/Spark/Ray/etc?
I am admittedly a tough sell when the workstation under my desk has 192GB of RAM.
- orlp2 days ago
  Disclaimer: I work for Polars Inc, but my opinions are my own.
  If you have a very beefy desktop machine and no giant datasets, there isn't a strong reason to use Polars Cloud.
  Are you a data scientist running a Polars data pipeline against a subsampled dataset in a notebook on your laptop? With just changing a couple lines of code you can run that same pipeline against your full dataset on a beefy cloud machine which is automatically spun up and spun down for you. If you have so much data that one machine doesn't cut it, you can start running distributed.
  In a nutshell, the pitch is very similar to Dask/Ray/Spark, except that it's Polars. A lot of our users say that they came for the speed but stayed for the API, and with Polars Cloud they can use our API and semantics on the cloud. No need to translate it to Dask/Ray/Spark.
  - dwagnerkca day ago
    they came for the speed but stayed for the API
    This is exactly how I would describe my experience. When I talk to others about polars now I usually quickly mention its fast up front, but then mostly talk about the API, its composability, small surface area, etc. are really what make it great to work with. Having these same semantics backed by eager execution, query optimized lazy API, streaming engine, GPU engine, and now distributed auto-magical ephemeral boxes in the sky engine just make it that much better of a tool.
    gardnra day ago
    Being both eager and lazy does make it sound magical.
  - fastasucan2 days ago
    I think being able to run the same code locally and on the "cloud" is a great selling point. Developing on Spark feels hillariously ineffective.
- benrutter2 days ago
  Doesn't look like benchmarks are there yet, but knowing polars, I'd guess performance will be front and centre.
  I think the best selling point speaks to your workstation size- just start with polars vanilla. It'll work great for ages, and if you do need to scale, you can use polars cloud.
  That solves what I see as one if the big issues with a lot of these types of projects, which is the really poor performance at smaller sizes, meaning practically you end up using completely different frameworks based on size, which is a bif hassle if you want to rewrite in one direction.
  - 2 days ago
    undefined
- __mharrison__2 days ago
  Yeah, you can process 99% of tabular workloads with that. I generally advise my clients to work on a single node before attempting to scale out.
- film422 days ago
  I think this will be a hit with the big name audit companies. I know some use databricks for pyspark on the M&A side. As deals move forward and they get more data, they have to scale up their instances which isn't cheap. If polars enables serverless compute where you pay by the job, that could be a big win.
  And sure, databricks has an idle shutdown feature, but suppose it takes ~6 hours to process the deal report, and only the first hour needs the scaled up power to compute one table, and the rest of the jobs only need 1/10th the mem and cores. Polars could save these firms a lot of money.
  - serced2 days ago
    May I ask what part in M&A needs this much data processing? I am quite familiar with the field but did not yet see such tasks.
  - lmeyerov2 days ago
    I thought databricks has serverless recently already? Or is it by the notebook, while this is by the job?
    Centigonal2 days ago
    Databricks supports serverless for both interactive notebooks and jobs.
- tfehring2 days ago
  The obvious one is that you can handle bigger workloads than you can fit in RAM on a single machine. The more important but less obvious one is that it right-sizes the resources needed for each workload, so you're not running an 8GB job on an 8TB machine, and your manually-allocated 8GB server doesn't OOM when that job grows to 10GB next year.
- 2 days ago
  undefined
LaurensBER2 days ago
This is very impressive and definitely fills a huge hole in the whole data frame ecosystem.
I've been quite impressed with the Polars team and after using Pandas for years, Polars feels like a much needed fresh wind. Very excited to give this a go sometime soon!
robertkossa day ago
Love it! Competition for Databricks is always appreciated and I think having a competitor that is not running on the JVM is amazing. Working with polars feels always insanely lightweight compared to Spark. If you would provide Workflows / Scheduling out of the box, I would migrate my Spark jobs today :)
jt_ba day ago
Polars seems cool, but not willing to invest in adoption until Geo support is more mature. I find I'm preferring to run most operations I'd use dataframe libraries for in local SQL via DuckDB anyways.
tfehring2 days ago
This is really cool, not sure how I missed it. I assume catalog support will be added fairly quickly. But ironically I think the biggest barrier to adoption will be the lack of an off-ramp to a FOSS solution that companies can self-host. Obviously Polars itself is FOSS, but it understandably seems like there's no way to self-host a backend to point a `pc.ComputeContext` to. That will be an especially tough selling point for companies that are already on Spark. I wonder how much they'll focus on startups vs. trying to get bigger companies to switch, and whether they'll try a Spark compatibility layer like DataFusion (https://github.com/apache/datafusion-comet).
- orlp2 days ago
  Disclaimer: I work for Polars Inc, but my opinions are my own.
  Polars itself is FOSS and will remain FOSS.
  Self-hosted/on-site Polars Cloud is something we intend on developing as there is quite a bit of demand, but it is unlikely to be FOSS. It most likely will involve licensing of some sort. Ultimately we do have to make money, and we intend on doing that through Polars Cloud, self-hosted or not (as well as other ventures such as offering training, commercial support, etc).
  - tfehring2 days ago
    Yep I totally get it and would probably go the same route in Polars' situation. Just sharing how some of the data teams I'm familiar with would likely be thinking about the tradeoffs.
Larrikina day ago
As a hobbyist, I describe polars as pandas if it was planned for humans to use. It's great to use, I just hate running into issues trying to use it. I wish them luck
marquisdepolis2 days ago
This is very interesting, clearly there's a major pain point here to be addressed, especially the delta between local pandas work and distributed [pyspark] work!
Would love to test this out and do benchmarks against us/ Dask/ Spark/ Ray etc which have been our primary testing ground. Full disclosure, work at Bodo which has similar-ish aspirations (https://github.com/bodo-ai/Bodo), but FOSS all the way.
efxhoy2 days ago
Looks great! Can I run it on my own bare metal cluster? Will I need to buy a license?
__mharrison__2 days ago
Really excited for the Polars team. I've always been impressed by their work and responsiveness to issues I've filed in the past. The world is lifted when there is good competition like this.
whyho2 days ago
How does this integrate into existing services like aws glue? I fear that despite polars being good/better it will lack adoption since it cannot easily be integrated.
- th0ma52 days ago
  I think this is the main problem with this, like H2O offers Spark integration as well their own clustering solution, but most people with this problem have their own opinionated and bespoke needs.
TheAlchemist2 days ago
Having switched from Pandas to Polars recently, this is quite interesting and I guess performance wise it will be excellent.
2 days ago
undefined
melvinroest2 days ago
I just got into data analysis recently (former software engineer) and tried out pandas vs polars. I like polars way more because it feels like SQL but then sane, and it's faster. It's clear in what it tries to do. I didn't really have that with pandas.
- epistasis2 days ago
  I've been doing data analysis for decades, and stayed on R for a long time because Pandas was so bad.
  People complain about R, but compared to the multitude of import lice and unergonomic APIs in Pandas, R always felt like living in the future.
  Polars is a much much more sane API, but expressions are very clunky for doing basic computation. Or at least I can't find anything less clunky than pl.col("x") or pl.literal(2) where in R it's just x or 2.
  Still, I'm using Python a ton more now that polars has enough steam for others to be able to understand the code.
  - Centigonal2 days ago
    R's data.table is still my favorite data frames API, over pandas, polars, and spark dataframes. Plotly has edged out ggplot2, but that took a long time.
    IMO R is really slept on because it's limited to certain corners of academia, and that makes it seem scary and outdated to compsci folks. It's really a lovely language for data analysis.
  - minimaxir2 days ago
    > Or at least I can't find anything less clunky than pl.col("x") or pl.literal(2) where in R it's just x or 2.
    In many cases you can pass a string or numeric literal to a Polars function instead of the pl.col (e.g. select()/group_by()).
    Overall I agree it's less convenient than in dplyr in the cases where pl.col is required, sure, but not terrible and has the benefit of making the code less ambigious which reduces bugs.
    epistasis2 days ago
    I think compsci people can appreciate R as a language itself, because it has really beautiful language features. I think programmers hate it, because it's so different and lispy, with features that they can't really appreciate when coming from a C-style OOP mindset.
  - theLiminator2 days ago
    I think if that's too painful, you can introduce a convention of: ``` from polars import col as c, lit as l ```
    For anything production though, I just stick to pl.col and pl.lit as it's widely used.
    minimaxir2 days ago
    Coming from R, that introduces a different confusion problem as there, c() has a specific and common purpose. https://www.rdocumentation.org/packages/base/versions/3.6.2/...
    epistasis2 days ago
    Even then, the overhead of having an additional five characters per named variable is really unergonomic. I don't know of a way to get around it given Python's limited grammar and semantics without moving to something as Lispy as R.
    orlpa day ago
    Two characters, if you do `from polars import col as c` you can simply write `c.foo`, assuming the column name is a valid Python identifier.
    epistasisa day ago
    Oh that's very interesting, thanks!!
- BrenBarna day ago
  The thing with Polars is it's really hard for me to get past the annoyance of having to do `pl.col("blah")` instead of `df.blah`. I find pandas easier for quick interactive work which is basically everything I do with it.
  - ritchie46a day ago
    import polars.col as C
    C.blah
    prometheon1a day ago
    Thanks! I'm not sure if pl.col improved since the last time I looked at polars or if I was too lazy to find it, but pl.col (docs) look great!
- minimaxir2 days ago
  This may be a hot take, but there is now no reason to ever use pandas for new data analysis codebases. Polars is better in every way that matters.
  - latenightcoding2 days ago
    pandas has been around for years and never tried to sell me a service.
    theLiminator2 days ago
    Their (polars) FOSS solution isn't at all neuteured, imo that's a little bit of an unfair criticism. Yeah, they are trying to make their distributed query engine for-profit, but as a user of the single-node solution, I haven't been pressured at all to use their cloud solution.
  - melvinroest2 days ago
    Sure, just wanted to give the perspective of a new person walking into this field. I'd agree, but I think there are a lot of data analysts that have never heard of polars.
    Though, I guess they're not on this site :')
  - The-Ludwiga day ago
    Only thing I can think of is HDF5 support. That is currently stoping me from completely switching to polars.
  - comte70922 days ago
    It’s a bit of a hot take, but not wildly outlandish either.
    Pandas supports so many use cases and is still more feature rich than polars. But you always have the polars.DataFrame.to_pandas() function in your back pocket so realistically you can always at least start with polars.
babuloseoa day ago
I applied :D just now hehehe
c7THEC2DDFVV2V2 days ago
who covers egress costs?
- ritchie46a day ago
  The cluster runs in your own VPC.
otteromkrama day ago
How is this not an advertisement? Does HN tag those or nah?
whalesalad2 days ago
Never understood these kinds of cloud tools that deal with big data. You are paying enormous ingress/egress fees to do this.
- ritchie46a day ago
  Disclosure, I wrote this post. The compute-plane (cluster) will run in your own VPC.
- tfehring2 days ago
  That's almost certainly the main reason they're offering this on all 3 major public clouds from day 1.
- tomnipotent2 days ago
  > You are paying enormous ingress/egress fees to do this.
  It looks like their offering runs on the same cloud provider as the client, so no bandwidth fees. Right now it looks to be AWS, but mentions Azure/GCP/self-hosted.
unit1492 days ago
[dead]
ydjje2 days ago
[flagged]
- DrBenCarson2 days ago
  False, very random SO link that has been posted repeatedly
marxisttemp2 days ago
What does this project have to do with Serbia? They’re based in the Netherlands. They must have made a mistake when registering their domain name.
- ritchie46a day ago
  Nothing. Polars -> pola.rs
  The Polars name and a hint to the .rs file extension.
  - marxisttempa day ago
    I’m aware. I personally wouldn’t want to tie my infrastructure, nor provide funding, to the government of Serbia at this particular juncture in geopolitical time, but hey, you gotta have a cutesy ccTLD hack or you aren’t webscale.