Calculate sum based on multiple rows from list column for each row in pandas dataframe

Question

I have a dataframe that looks something like this:

df = pd.DataFrame({'id': range(5), 'col_to_sum': np.random.rand(5), 'list_col': [[], [1], [1,2,3], [2], [3,1]]})
    
    id  col_to_sum  list_col
0   0   0.557736    []
1   1   0.147333    [1]
2   2   0.538681    [1, 2, 3]
3   3   0.040329    [2]
4   4   0.984439    [3, 1]

In reality I have more columns and ~30000 rows but the extra columns are irrelevant for this. Note that all the list elements are from the id column and that the id column is not necessarily the same as the index.

I want to make a new column that for each row sums the values in col_to_sum corresponding to the ids in list_col. In this example that would be:

    id  col_to_sum  list_col    sum
0   0   0.557736    []          0.000000
1   1   0.147333    [1]         0.147333
2   2   0.538681    [1, 2, 3]   0.726343
3   3   0.040329    [2]         0.538681
4   4   0.984439    [3, 1]      0.187662

I have found a way to do this but it requires looping through the entire dataframe and is quite slow on the larger df with ~30000 rows (~6 min). The way I found was

df['sum'] = 0

for i in range(len(df)):
    mask = df['id'].isin(df['list_col'].iloc[i])
    df.loc[i, 'sum'] = df.loc[mask, 'col_to_sum'].sum()

Ideally I would want a vectorized way to do this but I haven't been able to do it. Any help is greatly appreciated.

Cool question. I don't care for the randomness in the example. Next time please set a seed or better yet, use non-random values for which the desired operation (here: sum) is easy to check in ones head. — timgeb
– timgeb, Commented Feb 24, 2022 at 17:52

timgeb · Accepted Answer · 2022-02-24 18:00:42Z

3

I'm using non-random values in this demo because they're easier to reproduce and check.

I'm also using an id-column ([0, 1, 3, 2, 4]) that is not identical to the index.

Setup:

>>> df = pd.DataFrame({'id': [0, 1, 3, 2, 4], 'col_to_sum': [1, 2, 3, 4, 5], 'list_col': [[], [1], [1, 2, 3], [2], [3, 1]]})
>>> df
   id  col_to_sum   list_col
0   0           1         []
1   1           2        [1]
2   3           3  [1, 2, 3]
3   2           4        [2]
4   4           5     [3, 1]

Solution:

df = df.set_index('id')
df['sum'] = df['list_col'].apply(lambda l: df.loc[l, 'col_to_sum'].sum())
df = df.reset_index()

Output:

>>> df
   id  col_to_sum   list_col  sum
0   0           1         []    0
1   1           2        [1]    2
2   3           3  [1, 2, 3]    9
3   2           4        [2]    4
4   4           5     [3, 1]    5

edited Feb 24, 2022 at 18:00

answered Feb 24, 2022 at 17:46

timgeb

79.2k20 gold badges129 silver badges150 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Anders G. Over a year ago

Thank you very much. This is working as I intended. I have since found out that there actually are elements in the lists that are not in the id col and so this throws an error: "Passing list-likes to .loc or [] with any missing labels is no longer supported". Any way to salvage?

timgeb Over a year ago

@AndersG. try

df = df.set_index('id'); df['sum'] = df['list_col'].apply(lambda l: df.reindex(l)['col_to_sum'].sum()); df = df.reset_index()

.

Anders G. Over a year ago

I get "ValueError: cannot reindex from a duplicate axis"

timgeb Over a year ago

@AndersG. seems like you forgot to mention that the id column can contain duplicates.

Anders G. Over a year ago

Ahh, that makes sense. It shouldn't, though, so I have to look into that. Thank you!

|

ArchAngelPwn · Accepted Answer · 2022-02-24 17:52:27Z

0

You can use a lambda function that will let you use the list_col and find the iloc of the corresponding list_col to summarize

df['sum_col'] = df['list_col'].apply(lambda x : df['col_to_sum'].iloc[x].sum())

edited Feb 24, 2022 at 17:52

answered Feb 24, 2022 at 17:31

ArchAngelPwn

3,0461 gold badge6 silver badges17 bronze badges

2 Comments

timgeb Over a year ago

Note OP's statement "the id column is not necessarily the same as the index", so simply using iloc won't work here.

ArchAngelPwn Over a year ago

@timgeb Oh I see it now you are correct! thank you for the clarification!

Collectives™ on Stack Overflow

Calculate sum based on multiple rows from list column for each row in pandas dataframe

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related