2

How to flatten a simple (i.e. no nested structures) dataframe into a list? My problem set is detecting all the node pairs that have been changed/added/removed from a table of node pairs.

This means I have a "before" and "after" table to compare. Combining the before and after dataframe yields rows that describe where a pair appears in one dataframe but not the other.

Example:
+-----------+-----------+-----------+-----------+
|before.id1 |before.id2 |after.id1  |after.id2  |
+-----------+-----------+-----------+-----------+
|       null|       null|         E2|         E3|
|         B3|         B1|       null|       null|
|         I1|         I2|       null|       null|
|         A2|         A3|       null|       null|
|       null|       null|         G3|         G4|

The goal is to get a list of all the (distinct) nodes in the entire dataframe which would look like:

{A2,A3,B1,B3,E2,E3,G3,G4,I1,I2}

Potential approaches:

  • Union all the columns separately and distinct
  • flatMap and distinct
  • map and flatten

Since the structure is well known and simple it seems like there should be an equally straightforward solution. Which approach, or others, would be the simplest approach?

Other notes

  • Order of id1-id2 pair is only important to for change detection
  • Order in the resulting list is not important
  • DataFrame is between 10k and 100k rows
  • distinct in the resulting list is nice to have, but not required; assuming is trivial with the distinct operation
4
  • So why is there no timestamp then? Order of tuples is thus significant? Many files to process in 1 run? Commented Nov 2, 2018 at 18:06
  • @thebluephantom timestamp is not needed since there is essentially a "before" and "after" table. Order of tuples is significant but somewhat out of scope since I have a dataframe of 4 columns with comparable values. No files to run, only this dataframe but it is considerable in size. Commented Nov 2, 2018 at 18:16
  • Where I come from some form of timestamping is always req'd - many projects with that approach you just mentioned mean -> non-deterministic outcomes. Good luck though Commented Nov 2, 2018 at 18:35
  • 1
    Timestamping in similar situations may reduce the need for even solving this scenario in the first place - but all the same I'd like to know how to flatten a simple dataframe like this. Commented Nov 2, 2018 at 19:18

1 Answer 1

1

Try following, converting all rows into seqs and then collect all rows and then flatten the data and remove null value:

val df = Seq(("A","B"),(null,"A")).toDF 
val result = df.rdd.map(_.toSeq.toList)
   .collect().toList.flatten.toSet - null
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.