0

In this bug report: https://github.com/dask/dask/issues/8319 I had an issue with a workaround for the following. As it seems out of scope for that bug report, I'll ask the initial problem here:

import pandas as pd
import dask

# some example dataframe
df = pd.DataFrame([{"a": "A", "b": "B"}, {"a": "@", "b": "β"}, {"a": "Aa", "b": "Bb"}, {"a": "aa", "b": "bb"}])

# pandas version
df2 = df.set_index("a")
df2[df2.index.str.endswith("a")]
# this works, as pandas allows an "array" of the right length regardless of having the same index

# dask version
ddf = dask.dataframe.from_pandas(df, npartitions=2)
ddf2 = ddf.set_index("a")

# this works with a regular column
ddf2[ddf2.b.str.endswith("b")].compute()
# selects the rows where column b ends with "b"

# indices don't behave like columns
ddf2[ddf2.index.str.endswith("a")].compute()
# TypeError: '<' not supported between instances of 'bool' and 'str'

I'm not sure if this is a bug in dask, or just something impossible in dask since, once you're using multiple partitions, you can't know how to map an index on the partitions. (except this works fine in map_partitions as you're just working on pandas dataframes then)

Is there something I'm missing or is this something deeply ingrained in dask that can't be easily fixed?

1 Answer 1

1

related: BUG: Dask dataframe cannot handle string index #3269 and Better handling for arrays/series of keys in dask.dataframe.loc #8254 (both open).

I think the current fix is to create a boolean series and compute the result before using it to index into the DataFrame. This raises a warning but it seems to do the trick on this example:

In [19]: ddf2[ddf2.index.to_series().str.endswith('a').compute()].compute()
/.../lib/python3.9/site-packages/dask/dataframe/core.py:3703: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  meta = self._meta[_extract_meta(key)]
/.../lib/python3.9/site-packages/dask/core.py:121: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*(_execute_task(a, cache) for a in args))
Out[19]:
     b
a
Aa  Bb
aa  bb
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the response & the links! This is indeed almost the workaround I used (see also the linked issue, in the latest dask you don't need the compute on the index), but I was mainly looking for the "correct" way. It seems there currently is no correct way and my workaround was already the most "correct" way then. I'll leave this question open for a while in hopes of attracting more answers but if nothing happens I'll accept this for the linked bug & pull request.
yeah I think the fact that the linked issue has been marked a bug by the Dask team and they're working on a PR to fix it is a good indication that there's not really a "correct" way to do it... yet ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.