2

I need to add a column to an existing pandas dataframe based on an attribute from a second dataframe. I've made a minimal example to illustrate my exact requirements.

I've got two dataframes, one representing pairs of names, and the other representing an interaction between two individuals:

    >>> names
    id_a   id_b
0    ben   jack
1   jack    ben
2   jill   amir
3  wilma   jill
4   amir  wilma

>>> interactions
  individual1 individual2
0        jill        jack
1        jack        jill
2       wilma        jill
3        amir        jill
4        amir        jack
5        jack        amir
6        jill        amir

What I need is essentially this: for each pair of names in names, I need a count of the number of interactions between those two names, so the number of rows in interactions in which names['id_a'] is either interactions['individual1'] or interactions['individual2'] AND names['id_b'] is either interactions['individual1'] or interactions['individual2']. This count needs to be included in a column num_interactions for all rows in names, even if the names are duplicate (i.e. if there is a row in which id_a is ben and id_b is jack AND a row in which those names are reversed (id_a is jack and id_b is ben), the num_interactions should be included for both of those rows)

The resulting dataframe would look like this:

>>> names
    id_a   id_b  num_interactions
0    ben   jack               0.0
1   jack    ben               0.0
2   jill   amir               2.0
3  wilma   jill               1.0
4   amir  wilma               0.0
    enter code here

What I've Done

This works just fine, but it's ugly, hard to read, inefficient, and I know there must be a better way! Maybe with some sort of merge, but I don't really know how that works with complicated criteria...

for i in range(len(names)):
    names.loc[i, 'num_interactions'] = len(
        interactions[((interactions['individual1'] == names.loc[i, 'id_a']) &
                      (interactions['individual2'] == names.loc[i, 'id_b'])) |
                     ((interactions['individual2'] == names.loc[i, 'id_a']) &
                      (interactions['individual1'] == names.loc[i, 'id_b']))
                     ])

To Reproduce my example dataframes

In case you want to play around with this, you can use this to reproduce my dummy dataframes above.

import pandas as pd
names = pd.DataFrame(data={'id_a': ['ben', 'jack', 'jill', 'wilma', 'amir'],
                           'id_b': ['jack', 'ben', 'amir', 'jill', 'wilma']})

interactions = pd.DataFrame(data={'individual1': ['jill', 'jack',
                                                  'wilma', 'amir',
                                                  'amir', 'jack', 'jill'],
                                  'individual2': ['jack', 'jill', 'jill',
                                                  'jill', 'jack', 'amir',
                                                  'amir']})

Thanks in advance!

2
  • something like names['num_interactions'] = interactions.groupby(['individual1 ','individual2']).transform('count') ? Commented Dec 20, 2017 at 18:23
  • @LucasDresl That won't handle the varying order amongst pairs, unfortunately. Commented Dec 20, 2017 at 18:27

2 Answers 2

1

Assuming order doesn't matter, you can sort each dataframe by their columns. For the second dataframe, count each group of interactions with groupby + count and then perform a left outer merge on the result and the first dataframe.

i = pd.DataFrame(np.sort(names, axis=1))
j = pd.DataFrame(np.sort(interactions, axis=1))

k = j.groupby(j.columns.tolist())[0].count().reset_index(name='count')
df = i.merge(k, on=[0, 1], how='left')\
      .fillna(0)\
      .rename(columns={0 : 'id_a', 1 : 'id_b'})
df.iloc[:, :2] = names.values

df

   id_a   id_b  count
0   ben   jack    0.0
1   ben   jack    0.0
2  amir   jill    2.0
3  jill  wilma    1.0
4  amir  wilma    0.0
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! This is close, but I should have been more clear: in the names, dataframe, if there is a "jack-ben" and a "ben-jack" row, each of them needs to be included, so I can't drop the duplicates and sort. I'll clarify in my original post.
@sacul See my edit? I've fixed the issue... it was pretty simple actually.
Yeah, that works! Still trying to mess around with your strategy as a base, so that I can end up with a dataframe in which the id_a and id_b are not modified.
@sacul If you settle on something you're satisfied with, feel free to make an edit to my answer, and smash that accept button when you feel your question has been answered. Good luck!
1

Sorry it is a bit ugly how i add new columns etc but you can get the idea and improve it... First I assume that all pairs in names are unique. So I give each pair an ID

names_ids = pd.DataFrame(pd.concat([names.iloc[:, 0] + '-' + names.iloc[:, 1],
                         names.iloc[:, 1] + '-' + names.iloc[:, 0]], 
                                   axis=0), 
                         columns=['pair'])
names_ids['id'] = names_ids.index
names_ids.index = names_ids.pair

Then I join these ids with the interactions where I again flip each pair in interactions

interactions_new = pd.DataFrame(pd.concat([interactions.iloc[:, 0] + '-' + interactions.iloc[:, 1],
                               interactions.iloc[:, 1] + '-' + interactions.iloc[:, 0]],
                                          axis=0),
                                columns=['pair'])
interactions_new['count'] = np.ones(len(interactions_new))

count_id = interactions_new.join(names_ids['id'], on='pair', how='left').groupby('id').count().loc[:, ['count']]
count_id['id'] = count_id.index

So in the end I just count each id in interactions:

names_ids.index = names_ids.id
result = count_id.join(names_ids.pair.iloc[:len(names_ids)/2], on='id', how='left')
result['count'] /= 2
print result

Ugly but no for loops and I get:

     count   id        pair
id                         
2.0      2  2.0   jill-amir
3.0      1  3.0  wilma-jill

2 Comments

Cool, this is a different way to approach it than I had thought, but it's effective. Only thing to think about is that as you flip the pairs, the counts end up doubled (in my original example, the count for jill-amir should have been 2, and the count for wilma-jill should have ended up at 1, your values are double this)
@sacul fixed :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.