Manipulating a large dataframe most efficiently

Question

Imagine I have this dataframe called temp:

temp = pd.DataFrame(index = [x for x in range(0, 10)], columns = list('abcd'))
for row in temp.index:
        temp.loc[row] = default_rng().choice(10, size=4, replace=False) 

temp.loc[1, 'b'] = np.nan
temp.loc[3, 'd'] = np.nan

df:

The values are the same nature as the indices. My goal is to create an adjacency matrix where the indices and columns are temp.index, where the matrix shows what values have appeared in each index's row.

What I have done:

temp2 = pd.DataFrame(index = temp.index, columns = temp.index)
for index in temp.index:  
    temp2.loc[index, temp.loc[index].dropna().values] = 1

temp2 = temp2.replace(np.nan, 0)

temp2:

This does the job: for example, temp2 shows that row index 0 is adjacent to indices 4,5,7, and 8. In other words, indices that existed in row 0 in temp have a value of 1 and others have a value of 0 in temp2.

Problem: There are 132K indices in the real temp and creating temp2 throws out a memory error. What is the most efficient way of getting to temp2. FWIW, the indices are range(132000). Also, I'm going to later convert this matrix to a Torch tensor of dimensions (2, number of edges) that shows the same adjacency info:

adj = torch.tensor(temp2.values)
edge_index = adj.nonzero().t().contiguous()

@Reinderien I know. That's why I'm asking for a better approach — Saeed
– Saeed, Commented Sep 7 at 18:16

mozway · Accepted Answer · 2025-09-07 15:22:27Z

2

First of all, the pandas approach to create the output would be a crosstab:

s = temp.stack()
out = (pd.crosstab(s.index.get_level_values(0), s.values)
         .rename_axis(index=None, columns=None)
      )

Output:

   0  1  2  3  4  5  6  7  8  9
0  0  0  0  0  1  1  0  1  1  0
1  0  0  0  0  1  0  0  0  1  1
2  0  1  0  0  1  0  0  1  0  1
3  1  1  0  0  0  0  0  1  0  0
4  0  0  1  0  1  0  0  1  0  1
5  0  0  1  0  0  0  1  0  1  1
6  0  1  0  1  1  0  0  1  0  0
7  1  0  0  1  0  1  1  0  0  0
8  1  1  0  0  0  0  0  1  1  0
9  0  1  0  0  0  1  1  0  0  1

However, if you goal is to create a tensor of shape (2, number_of_edges), why create an intermediate square DataFrame?

Directly create the desired tensor:

import torch

idx = s.index.get_level_values(0)
coord = torch.tensor([idx, s.values], dtype=torch.int32)

Output coord:

tensor([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6,
         6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9],
        [7, 4, 8, 5, 4, 8, 9, 1, 4, 9, 7, 0, 7, 1, 9, 2, 4, 7, 8, 9, 2, 6, 1, 3,
         7, 4, 0, 5, 6, 3, 1, 0, 8, 7, 6, 5, 1, 9]], dtype=torch.int32)

And if you want, you can create a sparse square tensor with sparse_coo_tensor:

out = torch.sparse_coo_tensor(coord, torch.ones(len(s)))

NB. if you have duplicate coordinates in an input row, you further need to coalesce.

Output:

tensor(indices=tensor([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5,
                        5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9],
                       [7, 4, 8, 5, 4, 8, 9, 1, 4, 9, 7, 0, 7, 1, 9, 2, 4, 7, 8,
                        9, 2, 6, 1, 3, 7, 4, 0, 5, 6, 3, 1, 0, 8, 7, 6, 5, 1, 9]]),
       values=tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                      1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                      1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
       size=(10, 10), nnz=38, layout=torch.sparse_coo)

answered Sep 7 at 15:22

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Aadvik Sep 7 at 16:58

this goes OOM for me on 135k x 135k

Saeed Sep 7 at 20:20

The torch tensor code is precisely what I needed. Thanks for being so to-the-point.

Collectives™ on Stack Overflow

Manipulating a large dataframe most efficiently

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related