Spark: Flatten simple multi-column DataFrame

Question

How to flatten a simple (i.e. no nested structures) dataframe into a list? My problem set is detecting all the node pairs that have been changed/added/removed from a table of node pairs.

This means I have a "before" and "after" table to compare. Combining the before and after dataframe yields rows that describe where a pair appears in one dataframe but not the other.

Example:
+-----------+-----------+-----------+-----------+
|before.id1 |before.id2 |after.id1  |after.id2  |
+-----------+-----------+-----------+-----------+
|       null|       null|         E2|         E3|
|         B3|         B1|       null|       null|
|         I1|         I2|       null|       null|
|         A2|         A3|       null|       null|
|       null|       null|         G3|         G4|

The goal is to get a list of all the (distinct) nodes in the entire dataframe which would look like:

{A2,A3,B1,B3,E2,E3,G3,G4,I1,I2}

Potential approaches:

Union all the columns separately and distinct
flatMap and distinct
map and flatten

Since the structure is well known and simple it seems like there should be an equally straightforward solution. Which approach, or others, would be the simplest approach?

Other notes

Order of id1-id2 pair is only important to for change detection
Order in the resulting list is not important
DataFrame is between 10k and 100k rows
distinct in the resulting list is nice to have, but not required; assuming is trivial with the distinct operation

So why is there no timestamp then? Order of tuples is thus significant? Many files to process in 1 run? — Ged
– Ged, Commented Nov 2, 2018 at 18:06
@thebluephantom timestamp is not needed since there is essentially a "before" and "after" table. Order of tuples is significant but somewhat out of scope since I have a dataframe of 4 columns with comparable values. No files to run, only this dataframe but it is considerable in size. — joynoele
– joynoele, Commented Nov 2, 2018 at 18:16
Where I come from some form of timestamping is always req'd - many projects with that approach you just mentioned mean -> non-deterministic outcomes. Good luck though — Ged
– Ged, Commented Nov 2, 2018 at 18:35
Timestamping in similar situations may reduce the need for even solving this scenario in the first place - but all the same I'd like to know how to flatten a simple dataframe like this. — joynoele
– joynoele, Commented Nov 2, 2018 at 19:18

Anurag Sharma · Accepted Answer · 2018-11-03 09:55:35Z

1

Try following, converting all rows into seqs and then collect all rows and then flatten the data and remove null value:

val df = Seq(("A","B"),(null,"A")).toDF 
val result = df.rdd.map(_.toSeq.toList)
   .collect().toList.flatten.toSet - null

answered Nov 3, 2018 at 9:55

Anurag Sharma

2,6152 gold badges21 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark: Flatten simple multi-column DataFrame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related