Find max value from different columns in a single row in scala DataFrame

Question

I tried to find out the max value from different columns in a single row in scala dataframe.

The data available in dataframe is as below.

+-------+---------------------------------------+---------------------------------------+---------------------------------------+
|    NUM|                                   SIG1|                                   SIG2|                                   SIG3|
+-------+---------------------------------------+---------------------------------------+---------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531001,"VALUE":4.7825}]|[{"TIME":1569560531002,"VALUE":2.7825}]|
|XXXXX01|[{"TIME":1569560541001,"VALUE":1.7825}]|[{"TIME":1569560541000,"VALUE":8.7825}]|[{"TIME":1569560541003,"VALUE":5.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531009,"VALUE":3.7825}]|        null                           |
|XXXXX02|[{"TIME":1569560531000,"VALUE":5.7825}]|[{"TIME":1569560531007,"VALUE":8.7825}]|[{"TIME":1569560531006,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":9.7825}]|[{"TIME":1569560531009,"VALUE":1.7825}]|[{"TIME":1569560531010,"VALUE":3.7825}]|

and the schema is

scala> DF.printSchema
root
 |-- NUM: string (nullable = true)
 |-- SIG1: string (nullable = true)
 |-- SIG2: string (nullable = true)
 |-- SIG3: string (nullable = true)

The expected output is as below.


+-------+--------------+----------+------------+------------+
|    NUM|      TIME    | SIG1|    |  SIG2      |  SIG3      |
+-------+--------------+----------+------------+------------+
|XXXXX01| 1569560531002| 3.7825   | 4.7825     | 2.7825     |
|XXXXX01| 1569560541003| 1.7825   | 8.7825     | 5.7825     |
|XXXXX01| 1569560531009| 3.7825   | 3.7825     | null       |
|XXXXX02| 1569560531007| 5.7825   | 8.7825     | 3.7825     |
|XXXXX02| 1569560531010| 9.7825   | 1.7825     | 3.7825     |

I need to add a new column with highest TIME from a single row and SIG columns with their value only.

Basically the TIME in each column will be replaced by the highest TIME value available in that row and explode the TIME and VALUEs.

Is there any UDF/functions to achieve this? Thanks in Advance.

Possible duplicate of Iterate through a column in Dataset which have array of key value pairs and find out a pair with max value — NIKHIL SUTHAR
– NIKHIL SUTHAR, Commented Oct 16, 2019 at 9:38
I had already provided solution of same issue at stackoverflow.com/questions/58128746/… — NIKHIL SUTHAR
– NIKHIL SUTHAR, Commented Oct 16, 2019 at 9:39

Kombajn zbożowy · Accepted Answer · 2019-10-17 09:12:45Z

1

Use get_json_object function to extract values from json stored as a string.

Then it's quite straightforward:

DF.withColumn("TIME", greatest(get_json_object('SIG1, "$[0].TIME"),
                               get_json_object('SIG2, "$[0].TIME"),
                               get_json_object('SIG3, "$[0].TIME")))
  .withColumn("SIG1", get_json_object('SIG1, "$[0].VALUE"))
  .withColumn("SIG2", get_json_object('SIG2, "$[0].VALUE"))
  .withColumn("SIG3", get_json_object('SIG3, "$[0].VALUE"))
  .show

edited Oct 17, 2019 at 9:12

answered Oct 16, 2019 at 9:34

Kombajn zbożowy

10.8k5 gold badges33 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Find max value from different columns in a single row in scala DataFrame

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related