Instruct Dask what records to include in each partition

Question

I have bank accounts records where each row is the monthly balance of the account:

Acct.Number   Date             Balance
1111          2020-03-31        1000
1111          2020-04-30        1300
1111          2020-05-31        1100
1111          2020-06-30        1400
2222          2020-03-31       31000
2222          2020-04-30       32000
2222          2020-05-31       31100
2222          2020-06-30       31300
 .                .               .
 .                .               .
9999          2020-03-31        8000
9999          2020-04-30        8800
9999          2020-05-31        8100
9999          2020-06-30        8500

Assume there are 10 million accounts and 10 years of data. What I need to do is to apply a function to each account.

For example, for each account I need to take the mean of the 10 years and calculate the difference between that mean and the balance of that date. For account 1111 it will look like this (the mean is 1200):

Acct.Number   Date             Balance     Difference
1111          2020-03-31        1000       -200
1111          2020-04-30        1300        100
1111          2020-05-31        1100       -100
1111          2020-06-30        1400        200

This is my thinking: once I have the data in multiple Dask partitions where each partition contains N accounts, with map_partitions I can process in parallel the accounts in each partition.

This works if an account with all its dates is in a single partition. If I let Dask arbitrarily separate the data into partitions, this will not happen. Is there a way to instruct Dask what data should be included in each partition?

As a side note, I create the Dask dataframe and partitions with Dask read_sql_query. The data is ordered by account/date.

SultanOrazbayev · Accepted Answer · 2022-08-08 03:54:17Z

2

Since the data is sorted by account, when loading the dataframe using read_sql_query function, specifying index_col="Acct.Number" will make sure that each partition contains information on a specific account number. It's also possible for multiple partitions to contain a specific account details, but the operations on these partitions will be faster, because dask will be aware of their contents.

answered Aug 8, 2022 at 3:54

SultanOrazbayev

16.7k3 gold badges25 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

ps0604 Over a year ago

Sultan, the data is ordered by Account Number + Date. Can I set the index to multiple fields, something like index_col="Acct.Number,Date" ? I need the data also ordered by date inside the map_partitions function.

ps0604 Over a year ago

Also. what if I create the dataframe with dd.from_pandas ? I don't see in that function the index_col parameter.

SultanOrazbayev Over a year ago

Unfortunately, index_col accepts only a single column right now. dd.from_pandaswill use the index information of the pandas dataframe, so if your pandas dataframe is indexed by acct.number, then the dask dataframe should also be (but it won't work for multi-indexed dataframes I think)...

ps0604 Over a year ago

I'm getting

TypeError: Provided index column is of type "object".  If divisions is not provided the index column type must be numeric or datetime.

I don't want to provide the divisions, I want Dask to calculate the divisions for me. I guess that I will have to make the divisions manually because the account number is a string, not a number.

SultanOrazbayev Over a year ago

Hmmm, it should be possible to convert string to numbers within sql (I'm not familiar with the syntax).

|

Collectives™ on Stack Overflow

Instruct Dask what records to include in each partition

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related