How to create edge list from spark data frame in Pyspark?

Question

I am using graphframes in pyspark for some graph type of analytics and wondering what would be the best way to create the edge list data frame from a vertices data frame.

For example, below is my vertices data frame. I have a list of ids and they belong to different groups.

+---+-----+
|id |group|
+---+-----+
|a  |1    |
|b  |2    |
|c  |1    |
|d  |2    |
|e  |3    |
|a  |3    |
|f  |1    |
+---+-----+

My objective is to create an edge list data frame to indicate ids which appear in common groups. Please note that 1 id could appear in multiple groups (e.g. id a above is in group 1 and 3). Below is the edge list data frame that I'd like to get:

+---+-----+-----+
|src|dst  |group|
+---+-----+-----+
|a  |c    |1    |
|a  |f    |1    |
|c  |f    |1    |
|b  |d    |2    |
|a  |e    |3    |
+---+-----+-----+

Thanks in advance!

what if you add one more row (id='f', group=1), how do we know which id is src and which id is dst? is there any other columns to sort ids for each group? — jxc
– jxc, Commented Dec 29, 2020 at 3:55
@jxc This is a good point. Please see above for new examples including id = 'f' and group = 1. src and dst order does not have to be fixed in my case. As long as 2 ids in the same group can be shown in the same row, it would satisfy the needs. — MAMS
– MAMS, Commented Dec 29, 2020 at 15:19
just do a self-join: df.alias('d1').join(df.alias('d2'), ['group']).filter("d1.id < d2.id").toDF("group", "src", "dst") — jxc
– jxc, Commented Dec 29, 2020 at 16:02
@jxc I think you should post this as an answer. It is more straightforward than the other two answers. Your solution is only missing distinct() at the end (if we have, for example, two instances of (1,a), it will give us duplicate rows). — pegah
– pegah, Commented Feb 8, 2021 at 14:20

Kafels · Accepted Answer · 2020-12-29 16:12:15Z

5

Edit 1

Not sure if it's the better way to solve, but I did a workaround:

import pyspark.sql.functions as f

df = df.withColumn('match', f.collect_set('id').over(Window.partitionBy('group')))

df = df.select(f.col('id').alias('src'),
               f.explode('match').alias('dst'),
               f.col('group'))

df = df.withColumn('duplicate_edges', f.array_sort(f.array('src', 'dst')))
df = (df
      .where(f.col('src') != f.col('dst'))
      .drop_duplicates(subset=['duplicate_edges'])
      .drop('duplicate_edges'))

df.sort('group', 'src', 'dst').show()

Output

+---+---+-----+
|src|dst|group|
+---+---+-----+
|  a|  c|    1|
|  a|  f|    1|
|  c|  f|    1|
|  b|  d|    2|
|  e|  a|    3|
+---+---+-----+

Original answer

Try this:

import pyspark.sql.functions as f

df = (df
      .groupby('group')
      .agg(f.first('id').alias('src'),
           f.last('id').alias('dst')))

df.show()

Output:

+-----+---+---+
|group|src|dst|
+-----+---+---+
|    1|  a|  c|
|    3|  e|  a|
|    2|  b|  d|
+-----+---+---+

edited Dec 29, 2020 at 16:12

answered Dec 29, 2020 at 3:03

Kafels

4,0891 gold badge18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Marioanzas Over a year ago

What @Kafels proposes is absolutely right. However, do not forget to include the following at the start of your code: import pyspark.sql.functions as f

MAMS Over a year ago

Thank you both for the answer and this is the great approach! The only thing missing is that when I have more than 2 ids in the same group, only the 1st and last ids will show as src and dst, but the others will be missed. For example, as what @jxc mentioned in the comment, if we have another record id = 'f' and group = 1, I'd expect a,c, f in group 1 to be appeared in result data frame. And the src & dst order doesn't really matter. I have updated my example in the question, would you be able to think of a way handling it? Thanks!

mck · Accepted Answer · 2020-12-29 16:22:20Z

3

You can do a self join:

df = df.toDF('src', 'group')
df2 = df.toDF('dst', 'group2')

result = df.join(
    df2,
    (df.group == df2.group2) & (df.src < df2.dst)
).select('src', 'dst', 'group').distinct().orderBy('group', 'src', 'dst')

result.show()
+---+---+-----+
|src|dst|group|
+---+---+-----+
|  a|  c|    1|
|  a|  f|    1|
|  c|  f|    1|
|  b|  d|    2|
|  a|  e|    3|
+---+---+-----+

answered Dec 29, 2020 at 16:22

mck

42.7k13 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

How to create edge list from spark data frame in Pyspark?

2 Answers 2

Edit 1

Original answer

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Edit 1

Original answer

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related