dask index not behaving like a column (and not like in pandas)

Question

In this bug report: https://github.com/dask/dask/issues/8319 I had an issue with a workaround for the following. As it seems out of scope for that bug report, I'll ask the initial problem here:

import pandas as pd
import dask

# some example dataframe
df = pd.DataFrame([{"a": "A", "b": "B"}, {"a": "@", "b": "β"}, {"a": "Aa", "b": "Bb"}, {"a": "aa", "b": "bb"}])

# pandas version
df2 = df.set_index("a")
df2[df2.index.str.endswith("a")]
# this works, as pandas allows an "array" of the right length regardless of having the same index

# dask version
ddf = dask.dataframe.from_pandas(df, npartitions=2)
ddf2 = ddf.set_index("a")

# this works with a regular column
ddf2[ddf2.b.str.endswith("b")].compute()
# selects the rows where column b ends with "b"

# indices don't behave like columns
ddf2[ddf2.index.str.endswith("a")].compute()
# TypeError: '<' not supported between instances of 'bool' and 'str'

I'm not sure if this is a bug in dask, or just something impossible in dask since, once you're using multiple partitions, you can't know how to map an index on the partitions. (except this works fine in map_partitions as you're just working on pandas dataframes then)

Is there something I'm missing or is this something deeply ingrained in dask that can't be easily fixed?

Michael Delgado · Accepted Answer · 2021-11-03 01:26:55Z

1

related: BUG: Dask dataframe cannot handle string index #3269 and Better handling for arrays/series of keys in dask.dataframe.loc #8254 (both open).

I think the current fix is to create a boolean series and compute the result before using it to index into the DataFrame. This raises a warning but it seems to do the trick on this example:

In [19]: ddf2[ddf2.index.to_series().str.endswith('a').compute()].compute()
/.../lib/python3.9/site-packages/dask/dataframe/core.py:3703: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  meta = self._meta[_extract_meta(key)]
/.../lib/python3.9/site-packages/dask/core.py:121: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*(_execute_task(a, cache) for a in args))
Out[19]:
     b
a
Aa  Bb
aa  bb

answered Nov 3, 2021 at 1:26

Michael Delgado

15.7k4 gold badges39 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Joran Dox Over a year ago

Thanks for the response & the links! This is indeed almost the workaround I used (see also the linked issue, in the latest dask you don't need the compute on the index), but I was mainly looking for the "correct" way. It seems there currently is no correct way and my workaround was already the most "correct" way then. I'll leave this question open for a while in hopes of attracting more answers but if nothing happens I'll accept this for the linked bug & pull request.

Michael Delgado Over a year ago

yeah I think the fact that the linked issue has been marked a bug by the Dask team and they're working on a PR to fix it is a good indication that there's not really a "correct" way to do it... yet ;)

Joran Dox Over a year ago

github.com/dask/dask/pull/8254 is merged now :D

Collectives™ on Stack Overflow

dask index not behaving like a column (and not like in pandas)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related