I have a dataframe that looks something like this:
df = pd.DataFrame({'id': range(5), 'col_to_sum': np.random.rand(5), 'list_col': [[], [1], [1,2,3], [2], [3,1]]})
id col_to_sum list_col
0 0 0.557736 []
1 1 0.147333 [1]
2 2 0.538681 [1, 2, 3]
3 3 0.040329 [2]
4 4 0.984439 [3, 1]
In reality I have more columns and ~30000 rows but the extra columns are irrelevant for this. Note that all the list elements are from the id column and that the id column is not necessarily the same as the index.
I want to make a new column that for each row sums the values in col_to_sum corresponding to the ids in list_col. In this example that would be:
id col_to_sum list_col sum
0 0 0.557736 [] 0.000000
1 1 0.147333 [1] 0.147333
2 2 0.538681 [1, 2, 3] 0.726343
3 3 0.040329 [2] 0.538681
4 4 0.984439 [3, 1] 0.187662
I have found a way to do this but it requires looping through the entire dataframe and is quite slow on the larger df with ~30000 rows (~6 min). The way I found was
df['sum'] = 0
for i in range(len(df)):
mask = df['id'].isin(df['list_col'].iloc[i])
df.loc[i, 'sum'] = df.loc[mask, 'col_to_sum'].sum()
Ideally I would want a vectorized way to do this but I haven't been able to do it. Any help is greatly appreciated.