In this bug report: https://github.com/dask/dask/issues/8319 I had an issue with a workaround for the following. As it seems out of scope for that bug report, I'll ask the initial problem here:
import pandas as pd
import dask
# some example dataframe
df = pd.DataFrame([{"a": "A", "b": "B"}, {"a": "@", "b": "β"}, {"a": "Aa", "b": "Bb"}, {"a": "aa", "b": "bb"}])
# pandas version
df2 = df.set_index("a")
df2[df2.index.str.endswith("a")]
# this works, as pandas allows an "array" of the right length regardless of having the same index
# dask version
ddf = dask.dataframe.from_pandas(df, npartitions=2)
ddf2 = ddf.set_index("a")
# this works with a regular column
ddf2[ddf2.b.str.endswith("b")].compute()
# selects the rows where column b ends with "b"
# indices don't behave like columns
ddf2[ddf2.index.str.endswith("a")].compute()
# TypeError: '<' not supported between instances of 'bool' and 'str'
I'm not sure if this is a bug in dask, or just something impossible in dask since, once you're using multiple partitions, you can't know how to map an index on the partitions. (except this works fine in map_partitions as you're just working on pandas dataframes then)
Is there something I'm missing or is this something deeply ingrained in dask that can't be easily fixed?