I'm aware of Pandera[1] which has support for Polars as well but, while nice, it doesn't cause the code to fail to compile, it only fails at runtime. To me this is the achilles heel of analysis in both Python and R.
Does anybody have ideas on how this situation could be improved?
https://www.empirical-soft.com
It can infer the column names and types from a CSV file at compile time.
Here's an example that misspells the "ask" column as if it were plural:
let quotes = load("quotes.csv")
sort quotes by (asks - bid) / bid
The error is caught before the script is run: Error: symbol asks was not found
I had to use a lot of computer-science techniques to get this working, like type providers and compile-time function evaluation. I'm really proud of the novelty of it and even won Y Combinator's Startup School grant for it.Unfortunately, it didn't go anywhere as a project. Turns out that static typing isn't enough of a selling point for people to drop Python. I haven't touched Empirical in four years, but my code and my notes are still publicly available on the website.
I love how you really expanded on the idea of executing code at compile time. You should be proud.
You probably already know this but for people like me to switch "all" it would take would be:
1. A plotting library like ggplot2 or plotnine
2. A machine learning library, like scikit
3. A dashboard framework like streamlit or shiny
4. Support for Empirical in my cloud workspace environment, which is Jupyter based, and where I have to execute all the code, because that's where the data is and has to stay due to security
Just like how Polars is written in Rust and has Python bindings, I wonder if there's a market for 1 and 2 written in Rust and then having bindings to Python, Empirical, R, Julia etc. I feel like 4 is just a matter of time if Empirical becomes popular, but I think 3 would have to be implemented specifically for Empirical.
I think the idea of statically typed dataframes is really useful and you were ahead of your time. Maybe one day the time will be right.
There is no magic here. No language can guess the type of anything without seeing what the thing is.
[0] https://spark.apache.org/docs/latest/sql-programming-guide.h...
A tantalising idea i have not explored, is to try and hook up polars' lazy query planner to a static typing plugin. The planner already has basically complete knowledge of the schema at every point, right?
So in theory this could be used to give the really good inference abilities that a static typing system needs to be nice to use.
Otherwise, it feels so broken to just pass a dataframe around. It’s like typing everything as a “dict” and hoping for the best. It’s awful.
let df = df![
"name" => ["Alice", "Bob", "Charlie"],
"age" => [25, 30, 35]
]?;
let ages = df.column(“age”)?;
There’s no Rust type-level knowledge of what type the “age” or “name” column is, for example. The result of df.column is a Series, which has to be cast to a Rust type based on the developer’s knowledge of what the column is expected to contain.You can do things like this:
let oldies = df.filter(&df.column("age")?.gt(30)?)?;
So the casting can be automatic, but this will fail at runtime if the age column doesn’t contain numeric values.One type-related feature that Polars does have is because the contents of a Series is represented as a Rust value, all values in a series must have the same type. This is a constraint compared to traditional dataframes, but it provides a performance benefit when processing large series. You can cast an entire Series to a typed Rust value efficiently, and then operate on the result in a typed fashion.
But as you said, you can’t use Python libraries directly with Polars dataframes. You’d need conversion and foreign function interfaces. If you need that, you’d probably be better off just using Python.
And yeah my ??? with all these is, lacking dependent typing or equivalent for row types, it's hard for mypy and friends to statically track individual columns existing and being specific types. And even if we are willing to be explicit about wrapping each DF with a manual definition, basically an arrow schema, I don't think any of these libraries make that convenient? (And is that natively supported by any?)
In louie.ai, we generate python for users, so we can have it generate the types as well... But we haven't found a satisfactory library for that so far...
But to do it right you'd need a pretty good type system because these applications implicitly use a lot of isomorphisms between different mathematical objects. The current solution is just to ignore types and treat everything as a bag of floats with some shape. If you start tracking types you need a way to handle these isomorphisms.
Many also add an additional layer of data validation on top of schema validation, using frameworks like Great Expectations. For example, it's not enough to know 'age' is an Integer, it should be an integer in the range 0..150.
Disclaimer: i work for Hopsworks.
[1] https://docs.greatexpectations.io/docs/core/introduction/gx_...
The "diagonal scaling" approach seems particularly clever - dynamically choosing between horizontal and vertical scaling based on the query characteristics rather than forcing users into a one-size-fits-all model. Most real-world data workloads have mixed requirements, so this flexibility could be a major advantage.
I'm curious how the new streaming engine with out-of-core processing will compare to Dask, which has been in this space for a while but hasn't quite achieved the adoption of pandas/PySpark despite its strengths.
The unified API approach also tackles a real issue. The cognitive overhead of switching between pandas for local work and PySpark for distributed work is higher than most people acknowledge. Having a consistent mental model regardless of scale would be a productivity boost.
Anyway, I would love to apply for the early access and try it out. I'd be particularly interested in seeing benchmark comparisons against Ray, Dask, and Spark for different workload profiles. Also curious about the pricing model and the cold start problem that plagues many distributed systems.
The other is that we are only focussing on Polars and honor the Polars semantics and data model. Switching backends via Ibis doesn't honor this, as many architectures have different semantics regarding NaNs, missing data, order of them, decimal arithmetic behavior, regex engines, type upcasting, overflowing, etc.
And lastly, we will ensure it works seamlessly with the Polars landscape, that means that Polars Plugins and IO plugins will also be first class citizens.
I did this, rather than use snowflake, because our custom python “user defined functions” that process the data are not deployable on snowflake out of the gate, and the ergonomics of shipping custom code to modal are great, so I’m willing to pay a bit more complexity to ship data to modal in exchange for these great dev ergonomics.
All of that is to say: what does it look like to have custom python code running on my polars cloud in a distributed fashion? Is that a solved problem?
polars cloud might have an advantage here since they're optimizing directly around polars' own rust-based engine. i've done a fair bit of work lately using polars locally (huge fan of the lazy api), and if they can translate that speed and ergonomics smoothly into the cloud, it could be a real winner. the downside is obviously potential lock-in, but if it makes my day-to-day data wrangling faster, it might be worth the tradeoff.
curious to see benchmarks soon against dask, ray, and spark for some heavy analytics workloads.
I am admittedly a tough sell when the workstation under my desk has 192GB of RAM.
If you have a very beefy desktop machine and no giant datasets, there isn't a strong reason to use Polars Cloud.
Are you a data scientist running a Polars data pipeline against a subsampled dataset in a notebook on your laptop? With just changing a couple lines of code you can run that same pipeline against your full dataset on a beefy cloud machine which is automatically spun up and spun down for you. If you have so much data that one machine doesn't cut it, you can start running distributed.
In a nutshell, the pitch is very similar to Dask/Ray/Spark, except that it's Polars. A lot of our users say that they came for the speed but stayed for the API, and with Polars Cloud they can use our API and semantics on the cloud. No need to translate it to Dask/Ray/Spark.
This is exactly how I would describe my experience. When I talk to others about polars now I usually quickly mention its fast up front, but then mostly talk about the API, its composability, small surface area, etc. are really what make it great to work with. Having these same semantics backed by eager execution, query optimized lazy API, streaming engine, GPU engine, and now distributed auto-magical ephemeral boxes in the sky engine just make it that much better of a tool.
I think the best selling point speaks to your workstation size- just start with polars vanilla. It'll work great for ages, and if you do need to scale, you can use polars cloud.
That solves what I see as one if the big issues with a lot of these types of projects, which is the really poor performance at smaller sizes, meaning practically you end up using completely different frameworks based on size, which is a bif hassle if you want to rewrite in one direction.
And sure, databricks has an idle shutdown feature, but suppose it takes ~6 hours to process the deal report, and only the first hour needs the scaled up power to compute one table, and the rest of the jobs only need 1/10th the mem and cores. Polars could save these firms a lot of money.
I've been quite impressed with the Polars team and after using Pandas for years, Polars feels like a much needed fresh wind. Very excited to give this a go sometime soon!
Polars itself is FOSS and will remain FOSS.
Self-hosted/on-site Polars Cloud is something we intend on developing as there is quite a bit of demand, but it is unlikely to be FOSS. It most likely will involve licensing of some sort. Ultimately we do have to make money, and we intend on doing that through Polars Cloud, self-hosted or not (as well as other ventures such as offering training, commercial support, etc).
Would love to test this out and do benchmarks against us/ Dask/ Spark/ Ray etc which have been our primary testing ground. Full disclosure, work at Bodo which has similar-ish aspirations (https://github.com/bodo-ai/Bodo), but FOSS all the way.
People complain about R, but compared to the multitude of import lice and unergonomic APIs in Pandas, R always felt like living in the future.
Polars is a much much more sane API, but expressions are very clunky for doing basic computation. Or at least I can't find anything less clunky than pl.col("x") or pl.literal(2) where in R it's just x or 2.
Still, I'm using Python a ton more now that polars has enough steam for others to be able to understand the code.
IMO R is really slept on because it's limited to certain corners of academia, and that makes it seem scary and outdated to compsci folks. It's really a lovely language for data analysis.
In many cases you can pass a string or numeric literal to a Polars function instead of the pl.col (e.g. select()/group_by()).
Overall I agree it's less convenient than in dplyr in the cases where pl.col is required, sure, but not terrible and has the benefit of making the code less ambigious which reduces bugs.
For anything production though, I just stick to pl.col and pl.lit as it's widely used.
C.blah
Though, I guess they're not on this site :')
Pandas supports so many use cases and is still more feature rich than polars. But you always have the polars.DataFrame.to_pandas() function in your back pocket so realistically you can always at least start with polars.
It looks like their offering runs on the same cloud provider as the client, so no bandwidth fees. Right now it looks to be AWS, but mentions Azure/GCP/self-hosted.
The Polars name and a hint to the .rs file extension.