2

Imagine I have this dataframe called temp:

temp = pd.DataFrame(index = [x for x in range(0, 10)], columns = list('abcd'))
for row in temp.index:
        temp.loc[row] = default_rng().choice(10, size=4, replace=False) 

temp.loc[1, 'b'] = np.nan
temp.loc[3, 'd'] = np.nan

df:

enter image description here

The values are the same nature as the indices. My goal is to create an adjacency matrix where the indices and columns are temp.index, where the matrix shows what values have appeared in each index's row.

What I have done:

temp2 = pd.DataFrame(index = temp.index, columns = temp.index)
for index in temp.index:  
    temp2.loc[index, temp.loc[index].dropna().values] = 1

temp2 = temp2.replace(np.nan, 0)

temp2:

enter image description here

This does the job: for example, temp2 shows that row index 0 is adjacent to indices 4,5,7, and 8. In other words, indices that existed in row 0 in temp have a value of 1 and others have a value of 0 in temp2.

Problem: There are 132K indices in the real temp and creating temp2 throws out a memory error. What is the most efficient way of getting to temp2. FWIW, the indices are range(132000). Also, I'm going to later convert this matrix to a Torch tensor of dimensions (2, number of edges) that shows the same adjacency info:

adj = torch.tensor(temp2.values)
edge_index = adj.nonzero().t().contiguous()
2
  • Tell me - what's 132_000**2? Commented Sep 7 at 11:34
  • @Reinderien I know. That's why I'm asking for a better approach Commented Sep 7 at 18:16

1 Answer 1

2

First of all, the pandas approach to create the output would be a crosstab:

s = temp.stack()
out = (pd.crosstab(s.index.get_level_values(0), s.values)
         .rename_axis(index=None, columns=None)
      )

Output:

   0  1  2  3  4  5  6  7  8  9
0  0  0  0  0  1  1  0  1  1  0
1  0  0  0  0  1  0  0  0  1  1
2  0  1  0  0  1  0  0  1  0  1
3  1  1  0  0  0  0  0  1  0  0
4  0  0  1  0  1  0  0  1  0  1
5  0  0  1  0  0  0  1  0  1  1
6  0  1  0  1  1  0  0  1  0  0
7  1  0  0  1  0  1  1  0  0  0
8  1  1  0  0  0  0  0  1  1  0
9  0  1  0  0  0  1  1  0  0  1

However, if you goal is to create a tensor of shape (2, number_of_edges), why create an intermediate square DataFrame?

Directly create the desired tensor:

import torch

idx = s.index.get_level_values(0)
coord = torch.tensor([idx, s.values], dtype=torch.int32)

Output coord:

tensor([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,
         6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9],
        [7, 4, 8, 5, 4, 8, 9, 1, 4, 9, 7, 0, 7, 1, 9, 2, 4, 7, 8, 9, 2, 6, 1, 3,
         7, 4, 0, 5, 6, 3, 1, 0, 8, 7, 6, 5, 1, 9]], dtype=torch.int32)

And if you want, you can create a sparse square tensor with sparse_coo_tensor:

out = torch.sparse_coo_tensor(coord, torch.ones(len(s)))

NB. if you have duplicate coordinates in an input row, you further need to coalesce.

Output:

tensor(indices=tensor([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5,
                        5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9],
                       [7, 4, 8, 5, 4, 8, 9, 1, 4, 9, 7, 0, 7, 1, 9, 2, 4, 7, 8,
                        9, 2, 6, 1, 3, 7, 4, 0, 5, 6, 3, 1, 0, 8, 7, 6, 5, 1, 9]]),
       values=tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                      1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                      1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
       size=(10, 10), nnz=38, layout=torch.sparse_coo)
Sign up to request clarification or add additional context in comments.

2 Comments

this goes OOM for me on 135k x 135k
The torch tensor code is precisely what I needed. Thanks for being so to-the-point.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.