12

I want to find an alternative of pandas.dataframe.sort_value function in dask.
I came through set_index, but it would sort on a single column.

How can I sort multiple columns of Dask data frame?

1

1 Answer 1

10

So far Dask does not seem to support sorting by multiple columns. However, making a new column that concatenates the values of the sorted columns may be a usable work-around.

d['new_column'] = d.apply(lambda r: str([r.col1,r.col2]), axis=1)
d = d.set_index('new_column')
d = d.map_partitions(lambda x: x.sort_index())

Edit: The above works if you want to sort by two strings. I recommend creating integer (or bytes) columns and then using struct.pack to create a new composite bytes column. For example, if col1_dt is a datetime and col2 is an integer:

import struct

# create a timedelta with seconds resolution. 
# i know this is the resolution is correct
d['col1_int'] = ((d['col1_dt'] -
                  d['col1_dt'].min())/np.timedelta64(1,'s')
                ).astype(int)

d['new_column'] = d.apply(lambda r: struct.pack("ll",r.col1_int,r.col2))
d = d.set_index('new_column')
d = d.map_partitions(lambda x: x.sort_index())
Sign up to request clarification or add additional context in comments.

1 Comment

why do you recommend creating integer or bytes column? Does that make the sorting faster?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.