17

I have a pandas dataframe (think of if as a weighted adjacency matrix of nodes in a network) of the form, df,

    A    B    C    D
A   0   0.5   0.5  0 
B   1    0    0    0
C   0.8  0    0   0.2
D   0    0    1    0

I want to get a dataframe that instead represents an edge list. for the above example, I would need something of the form, edge_list_df,

    Source    Target    Weight    
0   A           B        0.5 
1   A           C        0.5
2   A           D        0
3   B           A        1
4   B           C        0
5   B           D        0
6   C           A        0.8
7   C           B        0
8   C           D        0.2
9   D           A        0
10  D           B        0
11  D           C        1

What is the most efficient way to create this?

4 Answers 4

10

Using rename_axis + reset_index + melt:

df.rename_axis('Source')\
  .reset_index()\
  .melt('Source', value_name='Weight', var_name='Target')\
  .query('Source != Target')\
  .reset_index(drop=True)

  Source Target  Weight
0       B      A     1.0
1       C      A     0.8
2       D      A     0.0
3       A      B     0.5
4       C      B     0.0
5       D      B     0.0
6       A      C     0.5
7       B      C     0.0
8       D      C     1.0
9       A      D     0.0
10      B      D     0.0
11      C      D     0.2

melt has been introduced as a function of the DataFrame object as of 0.20, and for older versions, you'd need pd.melt instead:

v = df.rename_axis('Source').reset_index()
df = pd.melt(
      v, 
      id_vars='Source', 
      value_name='Weight', 
      var_name='Target'
).query('Source != Target')\
 .reset_index(drop=True)

Timings

x = np.random.randn(1000, 1000)
x[[np.arange(len(x))] * 2] = 0

df = pd.DataFrame(x)

%%timeit
df.index.name = 'Source'
df.reset_index()\
  .melt('Source', value_name='Weight', var_name='Target')\
  .query('Source != Target')\
  .reset_index(drop=True)

1 loop, best of 3: 139 ms per loop

# Wen's solution

%%timeit
df.values[[np.arange(len(df))]*2] = np.nan
df.stack().reset_index()

10 loops, best of 3: 45 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

9

Mark diagonal as nan , then we stack

df.values[[np.arange(len(df))]*2] = np.nan
df
Out[172]: 
     A    B    C    D
A  NaN  0.5  0.5  0.0
B  1.0  NaN  0.0  0.0
C  0.8  0.0  NaN  0.2
D  0.0  0.0  1.0  NaN
df.stack().reset_index()
Out[173]: 
   level_0 level_1    0
0        A       B  0.5
1        A       C  0.5
2        A       D  0.0
3        B       A  1.0
4        B       C  0.0
5        B       D  0.0
6        C       A  0.8
7        C       B  0.0
8        C       D  0.2
9        D       A  0.0
10       D       B  0.0
11       D       C  1.0

5 Comments

@cᴏʟᴅsᴘᴇᴇᴅ I get this one by Tai's question and your answer :-), this one stackoverflow.com/questions/48201521/…
Nice. Added timings. This is also faster.
You could use np.fill_diagonal(df.values, np.nan) for diagonal setting.
Is it possible to change "level_0", "level_1", "0"? For example, to "From", "To", "Weight"?
@LarryCai you can rename ~df.stack().reset_index().rename(columns= {'level_0' :' From'...})
5

Two approaches using NumPy tools -

Approach #1

def edgelist(df):
    a = df.values
    c = df.columns
    n = len(c)
    
    c_ar = np.array(c)
    out = np.empty((n, n, 2), dtype=c_ar.dtype)
    
    out[...,0] = c_ar[:,None]
    out[...,1] = c_ar
    
    mask = ~np.eye(n,dtype=bool)
    df_out = pd.DataFrame(out[mask], columns=[['Source','Target']])
    df_out['Weight'] = a[mask]
    return df_out

Sample run -

In [155]: df
Out[155]: 
     A    B    C    D
A  0.0  0.5  0.5  0.0
B  1.0  0.0  0.0  0.0
C  0.8  0.0  0.0  0.2
D  0.0  0.0  1.0  0.0

In [156]: edgelist(df)
Out[156]: 
   Source Target  Weight
0       A      B     0.5
1       A      C     0.5
2       A      D     0.0
3       B      A     1.0
4       B      C     0.0
5       B      D     0.0
6       C      A     0.8
7       C      B     0.0
8       C      D     0.2
9       D      A     0.0
10      D      B     0.0
11      D      C     1.0

Approach #2

# https://stackoverflow.com/a/46736275/ @Divakar
def skip_diag_strided(A):
    m = A.shape[0]
    strided = np.lib.stride_tricks.as_strided
    s0,s1 = A.strides
    return strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1))

# https://stackoverflow.com/a/48234170/ @Divakar
def combinations_without_repeat(a):
    n = len(a)
    out = np.empty((n,n-1,2),dtype=a.dtype)
    out[:,:,0] = np.broadcast_to(a[:,None], (n, n-1))
    out.shape = (n-1,n,2)
    out[:,:,1] = onecold(a)
    out.shape = (-1,2)
    return out  

cols = df.columns.values.astype('S1')
df_out = pd.DataFrame(combinations_without_repeat(cols))
df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel()

Runtime test

Using @cᴏʟᴅsᴘᴇᴇᴅ's timing setup :

In [704]: x = np.random.randn(1000, 1000)
     ...: x[[np.arange(len(x))] * 2] = 0
     ...: 
     ...: df = pd.DataFrame(x)

# @cᴏʟᴅsᴘᴇᴇᴅ's soln
In [705]: %%timeit
     ...: df.index.name = 'Source'
     ...: df.reset_index()\
     ...:   .melt('Source', value_name='Weight', var_name='Target')\
     ...:   .query('Source != Target')\
     ...:   .reset_index(drop=True)
10 loops, best of 3: 67.4 ms per loop

# @Wen's soln
In [706]: %%timeit
     ...: df.values[[np.arange(len(df))]*2] = np.nan
     ...: df.stack().reset_index()
100 loops, best of 3: 19.6 ms per loop

# Proposed in this post - Approach #1
In [707]: %timeit edgelist(df)
10 loops, best of 3: 24.8 ms per loop

# Proposed in this post - Approach #2
In [708]: %%timeit
     ...: cols = df.columns.values.astype('S1')
     ...: df_out = pd.DataFrame(combinations_without_repeat(cols))
     ...: df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel()
100 loops, best of 3: 17.4 ms per loop

Comments

5

Using NetworkX 2.x API:

import networkx as nx

In [246]: G = nx.from_pandas_adjacency(df, create_using=nx.MultiDiGraph())

In [247]: G.edges(data=True)
Out[247]: OutMultiEdgeDataView([('A', 'B', {'weight': 0.5}), ('A', 'C', {'weight': 0.5}), ('B', 'A', {'weight': 1.0}), ('C', 'A', {'weight': 0.8}), ('C', 'D', {
'weight': 0.2}), ('D', 'C', {'weight': 1.0})])

In [248]: nx.to_pandas_edgelist(G)
Out[248]:
  source target  weight
0      A      B     0.5
1      A      C     0.5
2      B      A     1.0
3      C      A     0.8
4      C      D     0.2
5      D      C     1.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.