close

DEV Community

Herbert Tzekian
Herbert Tzekian

Posted on

When pandas runs out of RAM, I change one import line

You know the moment. You've got a Parquet file or a fat CSV, you write the obvious pd.read_parquet(...), and Python sits there. Then either it finishes in its own sweet time, or you get the message everyone who's done data work in Python has seen at least once:

MemoryError: Unable to allocate ... for an array
Enter fullscreen mode Exit fullscreen mode

The file is bigger than your RAM, pandas wants the whole thing in memory at once, and your laptop says no.

I used to solve this by chunking, or by giving up and rewriting the whole notebook in SQL against some database just to run one GROUP BY. Both are annoying, and the second one means I now maintain two versions of the same analysis. What I do now is change one import line and keep writing the exact same pandas I already had. No server, no connection string, and crucially no SQL, because I never leave the DataFrame API.

The one-line change

The library is chDB. It's ClickHouse compiled as an in-process engine you pip install, basically the ClickHouse query engine as a Python library, with nothing to run in Docker and nothing to connect to. If you want the internals, it's ClickHouse compiled as an in-process engine, but the part that changed my day-to-day is its datastore module.

pip install chdb
Enter fullscreen mode Exit fullscreen mode
# Change this:
import pandas as pd
# To this:
import chdb.datastore as pd
# Everything else stays the same.
Enter fullscreen mode Exit fullscreen mode

That's it. Your existing pandas code keeps working, with the same method names, the same arguments, and the same DataFrames coming back. The one thing that changes is how you point at a file too big to open: instead of pd.read_parquet, you hand the file to a DataStore, which streams it instead of loading it whole. From there it's the pandas API you already know. The difference underneath is that operations are lazy and compile down to the ClickHouse engine, so nothing executes until you actually need a result (a print(), a len(), plotting). The giant file never gets slurped into RAM in one go.

The query I'd otherwise have crashed on

Here's a real example. Events Parquet, I want the top user actions in the last 30 days. This is the textbook read_parquet (boom) followed by a filter and a groupby. Written as plain pandas, except it doesn't fall over:

import chdb.datastore as pd

events = pd.DataStore.from_file("events.parquet")

recent = events[events["event_date"] > "2026-05-21"]
top = (recent
       .groupby("action")
       .size()
       .sort_values(ascending=False)
       .head(20))

print(top)
Enter fullscreen mode Exit fullscreen mode

That's the same code I'd have written against pandas. I didn't drop into SQL, I didn't chunk anything, I didn't think about memory. Because it's lazy and columnar underneath, reading three columns out of forty doesn't pay for the other thirty-seven, and the filter gets pushed down into the file scan so rows I don't want never get decoded. That's the difference between "loads 12 GB" and "reads the slice you asked for."

It's the same pandas all the way through

You don't get a cut-down subset that handles groupby and bails on everything else. The whole DataFrame API is here: filtering, column selection, computed columns, the string and datetime accessors, joins:

import chdb.datastore as pd

reqs = pd.DataStore.from_file("requests.parquet")

reqs = reqs.assign(hour=reqs["ts"].dt.floor("h"))   # datetime accessor, computed column
hourly = (reqs
          .groupby("hour")["latency_ms"]
          .mean()
          .sort_index())

hourly.plot()   # tiny result now, plots like any pandas Series
Enter fullscreen mode Exit fullscreen mode

The big file got crunched down to a few hundred rows before anything had to fit in memory.

Joining across files (and databases) without leaving pandas

The bit I didn't expect to use as much as I do: the same DataFrame can be backed by a file or a remote source, and you join them with a normal .join(). So a Parquet file on disk and a table in Postgres or MySQL line up in one expression.

import chdb.datastore as pd

orders = pd.DataStore.from_file("orders.parquet")
customers = pd.DataStore.uri("mysql://root:pass@db:3306/crm/customers")

by_country = (orders
              .join(customers, left_on="customer_id", right_on="id")
              .groupby("country")["amount"]
              .sum()
              .sort_values(ascending=False))

print(by_country)
Enter fullscreen mode Exit fullscreen mode

It also reads CSV, TSV, JSON and friends the same way, so the same approach works for a giant CSV you'd otherwise be afraid to open, and for Parquet as shown above.

Why this became my default

The honest reason I keep using it isn't the speed, though it is faster. It's that I stopped maintaining two versions of my analysis.

Before, a notebook experiment used pandas, and then if it needed to run for real on the full dataset it got rewritten into SQL, against some warehouse, with a different dialect and a deployment story. With chdb.datastore the exploration code is the production code. It's pandas while I'm poking at a file on my laptop, and it's still pandas when the file is too big to open the normal way. I scale up by changing what the DataFrame points at, not by rewriting it.

So my rule of thumb now: if a file is small and friendly, sure, regular pandas is fine. The second it's big enough to make Python sweat, I swap the import line. The code doesn't change, it doesn't run out of RAM, and the throwaway version turns out not to be throwaway.

Next time you hit MemoryError, try changing import pandas as pd to import chdb.datastore as pd and rerun the cell. Usually that's the whole fix.

Top comments (0)