1

Given the following data:

Class       Name
======      =============
Math        John Smith
-------------------------
Math        Jenny Simmons
-------------------------
English     Sarah Blume
-------------------------
English     John Smith
-------------------------
Chemistry   Roger Tisch
-------------------------
Chemistry   Jenny Simmons
-------------------------
Physics     Sarah Blume
-------------------------
Physics     Jenny Simmons

I have a list of classes and names in each, like so:

[
{class: 'Math', student: 'John Smith'},
{class: 'Math', student: 'Jenny Simmons'},
{class: 'English', student: 'Sarah Blume'},
{class: 'English', student: 'John Smith'},
{class: 'Chemistry', student: 'John Smith'},
{class: 'Chemistry', student: 'Jenny Simmons'},
{class: 'Physics', student: 'Sarah Blume'},
{class: 'Physics', student: 'Jenny Simmons'},
]

I'd like to create an adjacency matrix, which would, as input, have the following structure, showing the number of students in common between each pair of classes:

enter image description here

How would I be able to do so in python/pandas in the most performant manner? I've got ~19M of these class/student pairs (~240MB) in my list.

3
  • See the section on crosstab stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe Commented Sep 3, 2019 at 22:10
  • 1
    @piRSquared: the linked article surely covers some part of the question, but not all of it, since he doesn't have the data in the structure which your method builds on. So I'm not sure if marking it a duplicate is justifiable. Commented Sep 3, 2019 at 23:27
  • 2
    @jottbe ok. I reopened it Commented Sep 4, 2019 at 0:06

1 Answer 1

1

You can prepare the data for the adjacency matrix like this:

# create the "class-tuples" by
# joining the dataframe with itself
df_cross= df.merge(df, on='student', suffixes=['_left', '_right'])
# remove the duplicate tuples
# --> this will get you a upper / or lower
# triangular matrix with diagonal = 0
# if you rather want to have a full matrix
# just change the >= to == below
del_indexer= (df_cross['class_left']>=df_cross['class_right'])
df_cross.drop(df_cross[del_indexer].index, inplace=True)
# create the counts / lists
grouby_obj= df_cross.groupby(['class_left', 'class_right'])
result= grouby_obj.count()
result.columns= ['value']
# if you want to have lists of student names
# that have the course-combination in
# common, you can do it with the following line
# otherwise just remove it (I guess with a 
# dataset of the size you mentioned, it will
# consume a lot of memory)
result['students']= grouby_obj.agg(list)

The full output looks like this:

Out[133]: 
                        value                     students
class_left class_right                                    
Chemistry  English          1                 [John Smith]
           Math             2  [John Smith, Jenny Simmons]
           Physics          1              [Jenny Simmons]
English    Math             1                 [John Smith]
           Physics          1                [Sarah Blume]
Math       Physics          1              [Jenny Simmons]

You then can use @piRSquared's method to pivot it, or do it like this:

result['value'].unstack()

Out[137]: 
class_right  English  Math  Physics
class_left                         
Chemistry        1.0   2.0      1.0
English          NaN   1.0      1.0
Math             NaN   NaN      1.0

Or, if you also want the names:

result.unstack()
Out[138]: 
              value                   students                                              
class_right English Math Physics       English                         Math          Physics
class_left                                                                                  
Chemistry       1.0  2.0     1.0  [John Smith]  [John Smith, Jenny Simmons]  [Jenny Simmons]
English         NaN  1.0     1.0           NaN                 [John Smith]    [Sarah Blume]
Math            NaN  NaN     1.0           NaN                          NaN  [Jenny Simmons]
Sign up to request clarification or add additional context in comments.

3 Comments

Incredibly thorough — a terrific answer, and a very clear explanation!
Thank you for your feedback. I'm glad I could help. Have you already tried it on your full dataset? Does it perform?
Just over 2 minutes for the full dataset, it's great!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.