Effectively create adjecency matrix from lists of neighbors from multiple CSVs

Question

I have a large collection of CSVs, each containing

an "id" column of values, and
a "neighbors" column with lists of strings.

Values from id do not occur in any neighbors lists. Values in id are unique per CSV, no CSV contains all the id values, and if two CSV share an id value, then the rows refer to the same object.

I would like to create something akin to a bipartite adjacency matrix from these CSVs, with a row for each each id value i, a column for each neighbors string j, and a 1 in cell (i,j) if there in any of the CSVs exists a row with id value i where string j occurs in the neighbors list.

The code below does what it's supposed to, but it takes very long.

Is there a more effective way of getting the job done?

import pandas as pd
adjacency_matrix = pd.DataFrame()
list_of_csv_files = [csv_file1, csv_file2, ...]

for file in list_of_csvs:
    df = pd.read_csv(file)
    df.set_index('id', inplace = True)

    for i in df.index:
        for j in df.at[i,'neighbors']:
            adjacency_matrix.at[i, j] = 1

Example:

Given that the csvs are loaded to the list list_of_dataframes = [df1,df2] with

df1 = pd.DataFrame(data={'id':['11','12'], 'N': [['a'], ['a', 'b']]})

    id  neighbors
0   11  [a]
1   12  [a, b]

df2 = pd.DataFrame(data={'id':['11','13'], 'N': [['c'], ['d']]})

    id  neighbors
0   11  [c]
1   13  [d]

I seek the dataframe

    a   b   c   d
11  1   NaN 1   NaN
12  1   1   NaN NaN
13  NaN NaN NaN 1

Rasmus · Accepted Answer · 2022-12-16 21:29:08Z

1

Yes there is a cleaner and efficient way using concat, explode and get_dummies. You can try this:

import pandas as pd
list_of_csv_files = [csv_file1, csv_file2, ...]
list_of_dfs = [pd.read_csv(file) for file in list_of_csv_files]
out = pd.concat(list_of_dfs)
out = out.explode('neighbors').drop_duplicates(ignore_index=True)
out.set_index('id', inplace=True)
out = pd.get_dummies(out, prefix='', prefix_sep='')
out = out.groupby('id').sum()

edited Dec 16, 2022 at 21:29

Rasmus

1521 silver badge11 bronze badges

answered Dec 16, 2022 at 14:33

SomeDude

14.3k5 gold badges26 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Rasmus Over a year ago

Thank you. I seem to loose the connection to the id values now, though, which I previously kept by writing them to the row index. Is it possible to do something similar here?

SomeDude Over a year ago

I updated the answer. please check.

Rasmus Over a year ago

Thank you. I've tested, and now added an example to be more clear (my apologies). When I try it on the dataframes of the example, out is returned correctly for 12 and 13, but incorrectly for 11, which misses an entry in column c.

SomeDude Over a year ago

Ok, that helped. Always give sample examples :) please try the code now.

Collectives™ on Stack Overflow

Effectively create adjecency matrix from lists of neighbors from multiple CSVs

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related