1

I have a dataframe that looks something like this:

df = pd.DataFrame({'id': range(5), 'col_to_sum': np.random.rand(5), 'list_col': [[], [1], [1,2,3], [2], [3,1]]})
    
    id  col_to_sum  list_col
0   0   0.557736    []
1   1   0.147333    [1]
2   2   0.538681    [1, 2, 3]
3   3   0.040329    [2]
4   4   0.984439    [3, 1]

In reality I have more columns and ~30000 rows but the extra columns are irrelevant for this. Note that all the list elements are from the id column and that the id column is not necessarily the same as the index.

I want to make a new column that for each row sums the values in col_to_sum corresponding to the ids in list_col. In this example that would be:

    id  col_to_sum  list_col    sum
0   0   0.557736    []          0.000000
1   1   0.147333    [1]         0.147333
2   2   0.538681    [1, 2, 3]   0.726343
3   3   0.040329    [2]         0.538681
4   4   0.984439    [3, 1]      0.187662

I have found a way to do this but it requires looping through the entire dataframe and is quite slow on the larger df with ~30000 rows (~6 min). The way I found was

df['sum'] = 0

for i in range(len(df)):
    mask = df['id'].isin(df['list_col'].iloc[i])
    df.loc[i, 'sum'] = df.loc[mask, 'col_to_sum'].sum()

Ideally I would want a vectorized way to do this but I haven't been able to do it. Any help is greatly appreciated.

1
  • Cool question. I don't care for the randomness in the example. Next time please set a seed or better yet, use non-random values for which the desired operation (here: sum) is easy to check in ones head. Commented Feb 24, 2022 at 17:52

2 Answers 2

3

I'm using non-random values in this demo because they're easier to reproduce and check.

I'm also using an id-column ([0, 1, 3, 2, 4]) that is not identical to the index.

Setup:

>>> df = pd.DataFrame({'id': [0, 1, 3, 2, 4], 'col_to_sum': [1, 2, 3, 4, 5], 'list_col': [[], [1], [1, 2, 3], [2], [3, 1]]})
>>> df
   id  col_to_sum   list_col
0   0           1         []
1   1           2        [1]
2   3           3  [1, 2, 3]
3   2           4        [2]
4   4           5     [3, 1]

Solution:

df = df.set_index('id')
df['sum'] = df['list_col'].apply(lambda l: df.loc[l, 'col_to_sum'].sum())
df = df.reset_index()

Output:

>>> df
   id  col_to_sum   list_col  sum
0   0           1         []    0
1   1           2        [1]    2
2   3           3  [1, 2, 3]    9
3   2           4        [2]    4
4   4           5     [3, 1]    5
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you very much. This is working as I intended. I have since found out that there actually are elements in the lists that are not in the id col and so this throws an error: "Passing list-likes to .loc or [] with any missing labels is no longer supported". Any way to salvage?
@AndersG. try df = df.set_index('id'); df['sum'] = df['list_col'].apply(lambda l: df.reindex(l)['col_to_sum'].sum()); df = df.reset_index().
I get "ValueError: cannot reindex from a duplicate axis"
@AndersG. seems like you forgot to mention that the id column can contain duplicates.
Ahh, that makes sense. It shouldn't, though, so I have to look into that. Thank you!
|
0

You can use a lambda function that will let you use the list_col and find the iloc of the corresponding list_col to summarize

df['sum_col'] = df['list_col'].apply(lambda x : df['col_to_sum'].iloc[x].sum())

2 Comments

Note OP's statement "the id column is not necessarily the same as the index", so simply using iloc won't work here.
@timgeb Oh I see it now you are correct! thank you for the clarification!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.