3

I am using graphframes in pyspark for some graph type of analytics and wondering what would be the best way to create the edge list data frame from a vertices data frame.

For example, below is my vertices data frame. I have a list of ids and they belong to different groups.

+---+-----+
|id |group|
+---+-----+
|a  |1    |
|b  |2    |
|c  |1    |
|d  |2    |
|e  |3    |
|a  |3    |
|f  |1    |
+---+-----+

My objective is to create an edge list data frame to indicate ids which appear in common groups. Please note that 1 id could appear in multiple groups (e.g. id a above is in group 1 and 3). Below is the edge list data frame that I'd like to get:

+---+-----+-----+
|src|dst  |group|
+---+-----+-----+
|a  |c    |1    |
|a  |f    |1    |
|c  |f    |1    |
|b  |d    |2    |
|a  |e    |3    |
+---+-----+-----+

Thanks in advance!

5
  • 1
    what if you add one more row (id='f', group=1), how do we know which id is src and which id is dst? is there any other columns to sort ids for each group? Commented Dec 29, 2020 at 3:55
  • @jxc This is a good point. Please see above for new examples including id = 'f' and group = 1. src and dst order does not have to be fixed in my case. As long as 2 ids in the same group can be shown in the same row, it would satisfy the needs. Commented Dec 29, 2020 at 15:19
  • @jxc I am using spark 2.3 Commented Dec 29, 2020 at 15:59
  • 2
    just do a self-join: df.alias('d1').join(df.alias('d2'), ['group']).filter("d1.id < d2.id").toDF("group", "src", "dst") Commented Dec 29, 2020 at 16:02
  • @jxc I think you should post this as an answer. It is more straightforward than the other two answers. Your solution is only missing distinct() at the end (if we have, for example, two instances of (1,a), it will give us duplicate rows). Commented Feb 8, 2021 at 14:20

2 Answers 2

5

Edit 1

Not sure if it's the better way to solve, but I did a workaround:

import pyspark.sql.functions as f

df = df.withColumn('match', f.collect_set('id').over(Window.partitionBy('group')))

df = df.select(f.col('id').alias('src'),
               f.explode('match').alias('dst'),
               f.col('group'))

df = df.withColumn('duplicate_edges', f.array_sort(f.array('src', 'dst')))
df = (df
      .where(f.col('src') != f.col('dst'))
      .drop_duplicates(subset=['duplicate_edges'])
      .drop('duplicate_edges'))

df.sort('group', 'src', 'dst').show()

Output

+---+---+-----+
|src|dst|group|
+---+---+-----+
|  a|  c|    1|
|  a|  f|    1|
|  c|  f|    1|
|  b|  d|    2|
|  e|  a|    3|
+---+---+-----+

Original answer

Try this:

import pyspark.sql.functions as f

df = (df
      .groupby('group')
      .agg(f.first('id').alias('src'),
           f.last('id').alias('dst')))

df.show()

Output:

+-----+---+---+
|group|src|dst|
+-----+---+---+
|    1|  a|  c|
|    3|  e|  a|
|    2|  b|  d|
+-----+---+---+
Sign up to request clarification or add additional context in comments.

2 Comments

What @Kafels proposes is absolutely right. However, do not forget to include the following at the start of your code: import pyspark.sql.functions as f
Thank you both for the answer and this is the great approach! The only thing missing is that when I have more than 2 ids in the same group, only the 1st and last ids will show as src and dst, but the others will be missed. For example, as what @jxc mentioned in the comment, if we have another record id = 'f' and group = 1, I'd expect a,c, f in group 1 to be appeared in result data frame. And the src & dst order doesn't really matter. I have updated my example in the question, would you be able to think of a way handling it? Thanks!
3

You can do a self join:

df = df.toDF('src', 'group')
df2 = df.toDF('dst', 'group2')

result = df.join(
    df2,
    (df.group == df2.group2) & (df.src < df2.dst)
).select('src', 'dst', 'group').distinct().orderBy('group', 'src', 'dst')

result.show()
+---+---+-----+
|src|dst|group|
+---+---+-----+
|  a|  c|    1|
|  a|  f|    1|
|  c|  f|    1|
|  b|  d|    2|
|  a|  e|    3|
+---+---+-----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.