It’s probably not worth incurring the pain of a compatibility-breaking Pandas upgrade. Switch to Polars instead for new projects and you won’t look back.
For better or worse, like Excel and like the simpler programming languages of old, Pandas lets you overwrite data in place.
Prepare some data
df_pandas = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [10, 20, 30, 40, 50]})
df_polars = pl.from_pandas(df_pandas)
And then df_pandas.loc[1:3, 'b'] += 1
df_pandas
a b
0 1 10
1 2 21
2 3 31
3 4 41
4 5 50
Polars comes from a more modern data engineering philosopy, and data is immutable. In Polars, if you ever wanted to do such a thing, you'd write a pipeline to process and replace the whole column. df_polars = df_polars.with_columns(
pl.when(pl.int_range(0, pl.len()).is_between(1, 3))
.then(pl.col("b") + 1)
.otherwise(pl.col("b"))
.alias("b")
)
If you are just interactively playing around with your data, and want to do it in Python and not in Excel or R, Pandas might still hit the spot. Or use Polars, and if need be then temporarily convert the data to Pandas or even to a Numpy array, manipulate, and then convert back.P.S. Polars has an optimization to overwite a single value
df_polars[4, 'b'] += 5
df_polars
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 10 │
│ 2 ┆ 21 │
│ 3 ┆ 31 │
│ 4 ┆ 41 │
│ 5 ┆ 55 │
└─────┴─────┘
But as far as I know, it doesn't allow slicing or anything.Polars is great, but it is better precisely because it learned from all the mistakes of Pandas. Don't besmirch the latter just because it now has to deal with the backwards compatibility of those mistakes, because when it first started, it was revolutionary.
Pandas created the modern Python data stack when there was not really any alternatives (except R and closed source). The original split-apply-combine paradigm was well thought out, simple, and effective, and the built in tools to read pretty much anything (including all of your awful csv files and excel tables) and deal with timestamps easily made it fit into tons of workflows. It pioneered a lot, and basically still serves as the foundation and common format for the industry.
I always recommend every member of my teams read Modern Pandas by Tom Augspurger when they start, as it covers all the modern concepts you need to get data work done fast and with high quality. The concepts carry over to polars.
And I have to thank the pandas team for being a very open and collaborative bunch. They’re humble and smart people, and every PR or issue I’ve interacted with them on has been great.
Polars is undeniably great software, it’s my standard tool today. But they did benefit from the failures and hard edges of pandas, pyspark, dask, the tidyverse, and xarray. It’s an advantage pandas didn’t have, and they still pay for.
I’m not trying to take away from polars at all. It’s damn fast — the benchmarks are hard to beat. I’ve been working on my own library and basically every optimization I can think of is already implemented in polars.
I do have a concern with their VC funding/commercialization with cloud. The core library is MIT licensed, but knowing they’ll always have this feauture wall when you want to scale is not ideal. I think it limits the future of the library a lot, and I think long term someone will fill that niche and the users will leave.
Unfortunately, there are a lot of third party libraries that work with Pandas that do not work with Polars, so the switch, even for new projects, should be done with that in mind.
I maintain one of those libraries and everything is polars internally.
Where I certainly disagree is the "frame as a dict of time series" setting, and general time series analysis.
The feel is also different. Pandas is an interactive data analysis container, poorly suited for production use. Polars I feel is the other way round.
Like jquery, which hasn’t fundamentally changed since I was a wee lad doing web dev. They didn’t make major changes despite their approach to web dev being replaced by newer concepts found on angular, backbone, mustache, and eventually react. And that is a good thing.
What I personally don’t want is something like angular that basically radically changed between 1.0 and 2.0. Might as well just call 2.0 something new.
Note: I’ve never heard of polars until this comment thread. Can’t wait to try it out.
I work with chemical datasets and this always involves converting SMILES string to Rdkit Molecule objects. Polars cannot do this as simply as calling .map on pandas.
Pandas is also much better to do EDA. So calling it worse in every instance is not true. If you are doing pure data manipulation then go ahead with polars
They get forked and stay open source? At least this is what happens to all the popular ones. You can't really un-open-source a project if users want to keep it open-source.
import polars as pl
from concurrent.futures import ProcessPoolExecutor
pl.DataFrame({"a": [1,2,3], "b": [4,5,6]}).write_parquet("test.parquet")
def read_parquet():
x = pl.read_parquet("test.parquet")
print(x.shape)
with ProcessPoolExecutor() as executor:
futures = [executor.submit(read_parquet) for _ in range(100)]
r = [f.result() for f in futures]
Using thread pool or "spawn" start method works but it makes polars a pain to use inside e.g. PyTorch dataloader import polars as pl
pl.DataFrame({"a": [1, 2, 3]}).write_parquet("test.parquet")
def print_shape(df: pl.DataFrame) -> pl.DataFrame:
print(df.shape)
return df
lazy_frames = [
pl.scan_parquet("test.parquet")
.map_batches(print_shape)
for _ in range(100)
]
pl.collect_all(lazy_frames, comm_subplan_elim=False)
(comm_subplan_elim is important)However, this is not a Polars issue. Using "fork" can leave ANY MUTEX in the system process invalid (a multi-threaded query engine has plenty of mutexes). It is highly unsafe and has the assumption that none of you libraries in your process hold a lock at that time. That's an assumption that's not PyTorch dataloaders to make.
Do they really still not have a good mechanism to toss a flag on a for loop to capture embarrassing parallelism easily?
OT, but I can’t imagine data science being a job category for too long. It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.
This is interesting. I wanted to dig into it a little since I am not sure I am following the logic of that statement.
Do you mean that AI would take over the field, because by default most people there are already not producing anything that a simple 'talk to data' LLM won't deliver?
I used to work on teams where DS would put a ton of time into building quality models, gating production with defensible metrics. Now, my DS counterparts are writing prompts and calling it a day. I'm not at all convinced that the results are better, but I guess if you don't spend time (=money) on the work, it's hard to argue with the ROI?
It's funny to look back at the tricks that were needed to get gpt3 and 3.5 to write SQL (e.g. "you are a data analyst looking at a SQL database with table [tables]"). It's almost effortless now.
I have integrated Explorer https://github.com/elixir-explorer/explorer, which leverages it, into many Elixir apps, so happy to have this.
Like you said, perhaps the demise of phind was inevitable, with large models displacing them kind of like how Spotify displaced music piracy.
Is this everyone's experience?
From there, of course, you slowly start to learn about types etc, and slowly you start to appreciate libraries and IDEs. But I knew tables, and statistics and graphs, and Pandas (with the visual style of Notebooks) lead me to programming via that familiar world. At first with some frustration about Pandas and needing to write to Excel, do stuff, and read again, but quickly moving into the opposite flow, where Excel itself became the limiting factor and being annoyed when having to use it.
I offered some "Programming for Biologists" courses, to teach people like me to do programming in this way, because it would be much less "dry" (pd.read.excel().barplot() and now you're programming). So far, wherever I offered the courses they said they prefer to teach programming "from the base up". Ah well! I've been told I'm not a programmer, I don't care. I solve problems (and that is the only way I am motivated enough to learn, I can't sit down solving LeetCode problems for hours, building exactly nothing).
(To be clear, I now do the Git, the Vim, the CI/CD, the LLM, the Bash, The Linux, the Nix, the Containers... Just like a real programmer, my journey was just different, and suited me well, I believe others can repeat my journey and find joy in programming, via a different route.)