0

I have a large collection of CSVs, each containing

  • an "id" column of values, and
  • a "neighbors" column with lists of strings.

Values from id do not occur in any neighbors lists. Values in id are unique per CSV, no CSV contains all the id values, and if two CSV share an id value, then the rows refer to the same object.

I would like to create something akin to a bipartite adjacency matrix from these CSVs, with a row for each each id value i, a column for each neighbors string j, and a 1 in cell (i,j) if there in any of the CSVs exists a row with id value i where string j occurs in the neighbors list.

The code below does what it's supposed to, but it takes very long.

Is there a more effective way of getting the job done?

import pandas as pd
adjacency_matrix = pd.DataFrame()
list_of_csv_files = [csv_file1, csv_file2, ...]

for file in list_of_csvs:
    df = pd.read_csv(file)
    df.set_index('id', inplace = True)

    for i in df.index:
        for j in df.at[i,'neighbors']:
            adjacency_matrix.at[i, j] = 1

Example:

Given that the csvs are loaded to the list list_of_dataframes = [df1,df2] with

df1 = pd.DataFrame(data={'id':['11','12'], 'N': [['a'], ['a', 'b']]})

    id  neighbors
0   11  [a]
1   12  [a, b]

df2 = pd.DataFrame(data={'id':['11','13'], 'N': [['c'], ['d']]})

    id  neighbors
0   11  [c]
1   13  [d]

I seek the dataframe

    a   b   c   d
11  1   NaN 1   NaN
12  1   1   NaN NaN
13  NaN NaN NaN 1

1 Answer 1

1

Yes there is a cleaner and efficient way using concat, explode and get_dummies. You can try this:

import pandas as pd
list_of_csv_files = [csv_file1, csv_file2, ...]
list_of_dfs = [pd.read_csv(file) for file in list_of_csv_files]
out = pd.concat(list_of_dfs)
out = out.explode('neighbors').drop_duplicates(ignore_index=True)
out.set_index('id', inplace=True)
out = pd.get_dummies(out, prefix='', prefix_sep='')
out = out.groupby('id').sum()
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you. I seem to loose the connection to the id values now, though, which I previously kept by writing them to the row index. Is it possible to do something similar here?
I updated the answer. please check.
Thank you. I've tested, and now added an example to be more clear (my apologies). When I try it on the dataframes of the example, out is returned correctly for 12 and 13, but incorrectly for 11, which misses an entry in column c.
Ok, that helped. Always give sample examples :) please try the code now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.