1

I have a spreadsheet with lists of names of people that a particular person reported working with on a number of projects. If I import it to pandas as a dataframe it will look like this:

       1                  2
Jane   ['Fred', 'Joe']    ['Joe', 'Fred', 'Bob']
Fred   ['Alex']           ['Jane']
Terry  NaN                ['Bob']
Bob    ['Joe']            ['Jane', 'Terry']
Alex   ['Fred']           NaN
Joe    ['Jane']           ['Jane']

I want to create an adjacency matrix that will look like this:

      Jane  Fred  Terry  Bob  Alex  Joe
Jane  0     2     0      1    0     2
Fred  1     0     0      0    1     0
Terry 0     0     0      1    0     0
Bob   1     0     1      0    0     1
Alex  0     1     0      0    0     0
Joe   2     0     0      0    0     0

This matrix, generally, will NOT be symmetric because of inconsistency with people's reports. I have been creating the adjacency matrix just by looping through the dataframe and incrementing the the matrix elements accordingly. Apparently, looping through dataframes is NOT recommended and inefficient, so does anyone have a suggestion on how his could be done more pythonically?

2
  • just to check, the content of the cells are list? can you verify with type(df.iloc[0,0]) Commented Aug 24, 2021 at 20:13
  • 1
    good question, when I import the data, it actually comes in as a string of the form '[Name1, Name2, ...]', but I apply a string2list() function to turn the contents of all the cells into lists. But, in short, the answer is yes, the type of each cell is list (unless it was np.NaN). BTW, instead of names, I actually have integer ID numbers (names have been anonymized) so the contents of each cell is actually a list of integers. Commented Aug 24, 2021 at 21:27

2 Answers 2

3

This is the sample of the data I used to work with.

df = pd.DataFrame({
    'Name': ['Jane', 'Fred', 'Terry', 'Bob', 'Alex', 'Joe'],
    '1':[['Fred', 'Joe'], ['Alex'], np.nan,['Joe'], ['Fred'], ['Jane']],
    '2': [['Joe', 'Fred', 'Bob'], ['Jane'], ['Bob'], ['Jane', 'Terry'], np.nan, ['Jane']]
})

df.head()
    Name            1                 2
0   Jane  [Fred, Joe]  [Joe, Fred, Bob]
1   Fred       [Alex]            [Jane]
2  Terry          NaN             [Bob]
3    Bob        [Joe]     [Jane, Terry]
4   Alex       [Fred]               NaN

I created the adjacency matrix using pandas in three simple steps.

First, I melted the data to have one column only for all the connections between the different names and dropped the variable column.

dff = df.melt(id_vars=['Name']).drop('variable', axis=1)
     Name             value
0    Jane       [Fred, Joe]
1    Fred            [Alex]
2   Terry               NaN
3     Bob             [Joe]
4    Alex            [Fred]
5     Joe            [Jane]
6    Jane  [Joe, Fred, Bob]
7    Fred            [Jane]
8   Terry             [Bob]
9     Bob     [Jane, Terry]
10   Alex               NaN
11    Joe            [Jane]

Secondly, I used the explode method to break down the rows with lists in separate rows.

dff = dff.explode('value')
     Name  value
0    Jane   Fred
0    Jane    Joe
1    Fred   Alex
2   Terry    NaN
3     Bob    Joe
4    Alex   Fred
5     Joe   Jane
6    Jane    Joe
6    Jane   Fred
6    Jane    Bob
7    Fred   Jane
8   Terry    Bob
9     Bob   Jane
9     Bob  Terry
10   Alex    NaN
11    Joe   Jane

Finally, to create the adjacency matrix I used crosstab within pandas which counts the occurrences in the two columns specified only.

pd.crosstab(dff['Name'], dff['value'])
value  Alex  Bob  Fred  Jane  Joe  Terry
Name                                    
Alex      0    0     1     0    0      0
Bob       0    0     0     1    1      1
Fred      1    0     0     1    0      0
Jane      0    1     2     0    2      0
Joe       0    0     0     2    0      0
Terry     0    1     0     0    0      0
Sign up to request clarification or add additional context in comments.

1 Comment

This is brilliant, and exactly what I was looking for. Thank you!
1

Here is one approach:

import pandas as pd
import ast

data = '''       1                  2
Jane   ['Fred', 'Joe']    ['Joe', 'Fred', 'Bob']
Fred   ['Alex']           ['Jane']
Terry  NaN                ['Bob']
Bob    ['Joe']            ['Jane', 'Terry']
Alex   ['Fred']           NaN
Joe    ['Jane']           ['Jane']'''

df = pd.read_csv(io.StringIO(data), sep='\s\s+', engine='python').fillna('[]').applymap(ast.literal_eval) #if your columns are already lists rather than string representations, use .fillna([]) and skip the applymap
df['all'] = df['1']+df['2'] #merge lists of columns 1 and 2

df_edges = df[['all']].explode('all').reset_index() #create new df by exploding the combined list
df_edges = df_edges.groupby(['index', 'all'])['all'].count().reset_index(name="count") #groupby and count the pairs

df_edges.pivot(index='index', columns='all', values='count').fillna(0) #create adjacency matrix with pivot

Output:

index Alex Bob Fred Jane Joe Terry
Alex 0 0 1 0 0 0
Bob 0 0 0 1 1 1
Fred 1 0 0 1 0 0
Jane 0 1 2 0 2 0
Joe 0 0 0 2 0 0
Terry 0 1 0 0 0 0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.