Creating a new column in Pandas based on another dataframe

Question

I need to add a column to an existing pandas dataframe based on an attribute from a second dataframe. I've made a minimal example to illustrate my exact requirements.

I've got two dataframes, one representing pairs of names, and the other representing an interaction between two individuals:

    >>> names
    id_a   id_b
0    ben   jack
1   jack    ben
2   jill   amir
3  wilma   jill
4   amir  wilma

>>> interactions
  individual1 individual2
0        jill        jack
1        jack        jill
2       wilma        jill
3        amir        jill
4        amir        jack
5        jack        amir
6        jill        amir

What I need is essentially this: for each pair of names in names, I need a count of the number of interactions between those two names, so the number of rows in interactions in which names['id_a'] is either interactions['individual1'] or interactions['individual2'] AND names['id_b'] is either interactions['individual1'] or interactions['individual2']. This count needs to be included in a column num_interactions for all rows in names, even if the names are duplicate (i.e. if there is a row in which id_a is ben and id_b is jack AND a row in which those names are reversed (id_a is jack and id_b is ben), the num_interactions should be included for both of those rows)

The resulting dataframe would look like this:

>>> names
    id_a   id_b  num_interactions
0    ben   jack               0.0
1   jack    ben               0.0
2   jill   amir               2.0
3  wilma   jill               1.0
4   amir  wilma               0.0
    enter code here

What I've Done

This works just fine, but it's ugly, hard to read, inefficient, and I know there must be a better way! Maybe with some sort of merge, but I don't really know how that works with complicated criteria...

for i in range(len(names)):
    names.loc[i, 'num_interactions'] = len(
        interactions[((interactions['individual1'] == names.loc[i, 'id_a']) &
                      (interactions['individual2'] == names.loc[i, 'id_b'])) |
                     ((interactions['individual2'] == names.loc[i, 'id_a']) &
                      (interactions['individual1'] == names.loc[i, 'id_b']))
                     ])

To Reproduce my example dataframes

In case you want to play around with this, you can use this to reproduce my dummy dataframes above.

import pandas as pd
names = pd.DataFrame(data={'id_a': ['ben', 'jack', 'jill', 'wilma', 'amir'],
                           'id_b': ['jack', 'ben', 'amir', 'jill', 'wilma']})

interactions = pd.DataFrame(data={'individual1': ['jill', 'jack',
                                                  'wilma', 'amir',
                                                  'amir', 'jack', 'jill'],
                                  'individual2': ['jack', 'jill', 'jill',
                                                  'jill', 'jack', 'amir',
                                                  'amir']})

Thanks in advance!

something like names['num_interactions'] = interactions.groupby(['individual1 ','individual2']).transform('count') ? — Lucas Dresl
– Lucas Dresl, Commented Dec 20, 2017 at 18:23
@LucasDresl That won't handle the varying order amongst pairs, unfortunately. — cs95
– cs95, Commented Dec 20, 2017 at 18:27

cs95 · Accepted Answer · 2017-12-20 18:42:55Z

1

Assuming order doesn't matter, you can sort each dataframe by their columns. For the second dataframe, count each group of interactions with groupby + count and then perform a left outer merge on the result and the first dataframe.

i = pd.DataFrame(np.sort(names, axis=1))
j = pd.DataFrame(np.sort(interactions, axis=1))

k = j.groupby(j.columns.tolist())[0].count().reset_index(name='count')

df = i.merge(k, on=[0, 1], how='left')\
      .fillna(0)\
      .rename(columns={0 : 'id_a', 1 : 'id_b'})
df.iloc[:, :2] = names.values

df

   id_a   id_b  count
0   ben   jack    0.0
1   ben   jack    0.0
2  amir   jill    2.0
3  jill  wilma    1.0
4  amir  wilma    0.0

edited Dec 20, 2017 at 18:42

answered Dec 20, 2017 at 18:27

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

sacuL Over a year ago

Thanks! This is close, but I should have been more clear: in the names, dataframe, if there is a "jack-ben" and a "ben-jack" row, each of them needs to be included, so I can't drop the duplicates and sort. I'll clarify in my original post.

cs95 Over a year ago

@sacul See my edit? I've fixed the issue... it was pretty simple actually.

sacuL Over a year ago

Yeah, that works! Still trying to mess around with your strategy as a base, so that I can end up with a dataframe in which the id_a and id_b are not modified.

cs95 Over a year ago

@sacul If you settle on something you're satisfied with, feel free to make an edit to my answer, and smash that accept button when you feel your question has been answered. Good luck!

mm441 · Accepted Answer · 2017-12-21 08:49:06Z

1

Sorry it is a bit ugly how i add new columns etc but you can get the idea and improve it... First I assume that all pairs in names are unique. So I give each pair an ID

names_ids = pd.DataFrame(pd.concat([names.iloc[:, 0] + '-' + names.iloc[:, 1],
                         names.iloc[:, 1] + '-' + names.iloc[:, 0]], 
                                   axis=0), 
                         columns=['pair'])
names_ids['id'] = names_ids.index
names_ids.index = names_ids.pair

Then I join these ids with the interactions where I again flip each pair in interactions

interactions_new = pd.DataFrame(pd.concat([interactions.iloc[:, 0] + '-' + interactions.iloc[:, 1],
                               interactions.iloc[:, 1] + '-' + interactions.iloc[:, 0]],
                                          axis=0),
                                columns=['pair'])
interactions_new['count'] = np.ones(len(interactions_new))

count_id = interactions_new.join(names_ids['id'], on='pair', how='left').groupby('id').count().loc[:, ['count']]
count_id['id'] = count_id.index

So in the end I just count each id in interactions:

names_ids.index = names_ids.id
result = count_id.join(names_ids.pair.iloc[:len(names_ids)/2], on='id', how='left')
result['count'] /= 2
print result

Ugly but no for loops and I get:

     count   id        pair
id                         
2.0      2  2.0   jill-amir
3.0      1  3.0  wilma-jill

edited Dec 21, 2017 at 8:49

answered Dec 20, 2017 at 18:55

mm441

4956 silver badges13 bronze badges

2 Comments

sacuL Over a year ago

Cool, this is a different way to approach it than I had thought, but it's effective. Only thing to think about is that as you flip the pairs, the counts end up doubled (in my original example, the count for jill-amir should have been 2, and the count for wilma-jill should have ended up at 1, your values are double this)

mm441 Over a year ago

@sacul fixed :)

Collectives™ on Stack Overflow

Creating a new column in Pandas based on another dataframe

What I've Done

To Reproduce my example dataframes

2 Answers 2

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

What I've Done

To Reproduce my example dataframes

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related