1

I have bank accounts records where each row is the monthly balance of the account:

Acct.Number   Date             Balance
1111          2020-03-31        1000
1111          2020-04-30        1300
1111          2020-05-31        1100
1111          2020-06-30        1400
2222          2020-03-31       31000
2222          2020-04-30       32000
2222          2020-05-31       31100
2222          2020-06-30       31300
 .                .               .
 .                .               .
9999          2020-03-31        8000
9999          2020-04-30        8800
9999          2020-05-31        8100
9999          2020-06-30        8500

Assume there are 10 million accounts and 10 years of data. What I need to do is to apply a function to each account.

For example, for each account I need to take the mean of the 10 years and calculate the difference between that mean and the balance of that date. For account 1111 it will look like this (the mean is 1200):

Acct.Number   Date             Balance     Difference
1111          2020-03-31        1000       -200
1111          2020-04-30        1300        100
1111          2020-05-31        1100       -100
1111          2020-06-30        1400        200

This is my thinking: once I have the data in multiple Dask partitions where each partition contains N accounts, with map_partitions I can process in parallel the accounts in each partition.

This works if an account with all its dates is in a single partition. If I let Dask arbitrarily separate the data into partitions, this will not happen. Is there a way to instruct Dask what data should be included in each partition?

As a side note, I create the Dask dataframe and partitions with Dask read_sql_query. The data is ordered by account/date.

0

1 Answer 1

2

Since the data is sorted by account, when loading the dataframe using read_sql_query function, specifying index_col="Acct.Number" will make sure that each partition contains information on a specific account number. It's also possible for multiple partitions to contain a specific account details, but the operations on these partitions will be faster, because dask will be aware of their contents.

Sign up to request clarification or add additional context in comments.

10 Comments

Sultan, the data is ordered by Account Number + Date. Can I set the index to multiple fields, something like index_col="Acct.Number,Date" ? I need the data also ordered by date inside the map_partitions function.
Also. what if I create the dataframe with dd.from_pandas ? I don't see in that function the index_col parameter.
Unfortunately, index_col accepts only a single column right now. dd.from_pandaswill use the index information of the pandas dataframe, so if your pandas dataframe is indexed by acct.number, then the dask dataframe should also be (but it won't work for multi-indexed dataframes I think)...
I'm getting TypeError: Provided index column is of type "object". If divisions is not provided the index column type must be numeric or datetime. I don't want to provide the divisions, I want Dask to calculate the divisions for me. I guess that I will have to make the divisions manually because the account number is a string, not a number.
Hmmm, it should be possible to convert string to numbers within sql (I'm not familiar with the syntax).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.