I have a parquet folder created with dask containing multiple files of about 100MB each. When I load the dataframe with df = dask.dataframe.read_parquet(path_to_parquet_folder), and run any sort of computation (such as df.describe().compute()), my kernel crashes.
Things I have noticed:
- CPU usage (about 100%) indicates that multithreading is not used
- memory usage shoots way past the size of a single file
- the kernel crashes after system memory usage approaches 100%
EDIT:
I tried to create a reproducible example, without success, but I discovered some other oddities, seemingly all related to the newer pandas dtypes that I'm using:
import pandas as pd
from dask.diagnostics import ProgressBar
ProgressBar().register()
from dask.diagnostics import ResourceProfiler
rprof = ResourceProfiler(dt=0.5)
import dask.dataframe as dd
# generate dataframe with 3 different nullable dtypes and n rows
n = 10000000
test = pd.DataFrame({
1:pd.Series(['a', pd.NA]*n, dtype = pd.StringDtype()),
2:pd.Series([1, pd.NA]*n, dtype = pd.Int64Dtype()),
3:pd.Series([0.56, pd.NA]*n, dtype = pd.Float64Dtype())
})
dd_df = dd.from_pandas(test, npartitions = 2) # convert to dask df
dd_df.to_parquet('test.parquet') # save as parquet directory
dd_df = dd.read_parquet('test.parquet') # load files back
dd_df.mean().compute() # compute something
dd_df.describe().compute() # compute something
dd_df.count().compute() # compute something
dd_df.max().compute() # compute something
Output, respectively:
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
Kernel appears to have died.
KeyError: "None of [Index(['2', '1', '3'], dtype='object')] are in the [columns]"
It seems that the dtypes are preserved even throughout the parquet IO, but dask has some trouble actually doing anything with these columns.
Python version: 3.9.7
dask version: 2021.11.2
df.get_partition(0).compute()without problems, the memory usage of a single partition is about 500MB.describe? I ask, because that includes std.dev and percentiles, which are not trivial to calculate. What is the total size of the dataset, how many files?KeyError.