Imagine I have this dataframe called temp:
temp = pd.DataFrame(index = [x for x in range(0, 10)], columns = list('abcd'))
for row in temp.index:
temp.loc[row] = default_rng().choice(10, size=4, replace=False)
temp.loc[1, 'b'] = np.nan
temp.loc[3, 'd'] = np.nan
df:
The values are the same nature as the indices. My goal is to create an adjacency matrix where the indices and columns are temp.index, where the matrix shows what values have appeared in each index's row.
What I have done:
temp2 = pd.DataFrame(index = temp.index, columns = temp.index)
for index in temp.index:
temp2.loc[index, temp.loc[index].dropna().values] = 1
temp2 = temp2.replace(np.nan, 0)
temp2:
This does the job: for example, temp2 shows that row index 0 is adjacent to indices 4,5,7, and 8. In other words, indices that existed in row 0 in temp have a value of 1 and others have a value of 0 in temp2.
Problem: There are 132K indices in the real temp and creating temp2 throws out a memory error. What is the most efficient way of getting to temp2. FWIW, the indices are range(132000). Also, I'm going to later convert this matrix to a Torch tensor of dimensions (2, number of edges) that shows the same adjacency info:
adj = torch.tensor(temp2.values)
edge_index = adj.nonzero().t().contiguous()

