0

I tried to find out the max value from different columns in a single row in scala dataframe.

The data available in dataframe is as below.

+-------+---------------------------------------+---------------------------------------+---------------------------------------+
|    NUM|                                   SIG1|                                   SIG2|                                   SIG3|
+-------+---------------------------------------+---------------------------------------+---------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531001,"VALUE":4.7825}]|[{"TIME":1569560531002,"VALUE":2.7825}]|
|XXXXX01|[{"TIME":1569560541001,"VALUE":1.7825}]|[{"TIME":1569560541000,"VALUE":8.7825}]|[{"TIME":1569560541003,"VALUE":5.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531009,"VALUE":3.7825}]|        null                           |
|XXXXX02|[{"TIME":1569560531000,"VALUE":5.7825}]|[{"TIME":1569560531007,"VALUE":8.7825}]|[{"TIME":1569560531006,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":9.7825}]|[{"TIME":1569560531009,"VALUE":1.7825}]|[{"TIME":1569560531010,"VALUE":3.7825}]|

and the schema is

scala> DF.printSchema
root
 |-- NUM: string (nullable = true)
 |-- SIG1: string (nullable = true)
 |-- SIG2: string (nullable = true)
 |-- SIG3: string (nullable = true)

The expected output is as below.


+-------+--------------+----------+------------+------------+
|    NUM|      TIME    | SIG1|    |  SIG2      |  SIG3      |
+-------+--------------+----------+------------+------------+
|XXXXX01| 1569560531002| 3.7825   | 4.7825     | 2.7825     |
|XXXXX01| 1569560541003| 1.7825   | 8.7825     | 5.7825     |
|XXXXX01| 1569560531009| 3.7825   | 3.7825     | null       |
|XXXXX02| 1569560531007| 5.7825   | 8.7825     | 3.7825     |
|XXXXX02| 1569560531010| 9.7825   | 1.7825     | 3.7825     |

I need to add a new column with highest TIME from a single row and SIG columns with their value only.

Basically the TIME in each column will be replaced by the highest TIME value available in that row and explode the TIME and VALUEs.

Is there any UDF/functions to achieve this? Thanks in Advance.

2

1 Answer 1

1

Use get_json_object function to extract values from json stored as a string.

Then it's quite straightforward:

DF.withColumn("TIME", greatest(get_json_object('SIG1, "$[0].TIME"),
                               get_json_object('SIG2, "$[0].TIME"),
                               get_json_object('SIG3, "$[0].TIME")))
  .withColumn("SIG1", get_json_object('SIG1, "$[0].VALUE"))
  .withColumn("SIG2", get_json_object('SIG2, "$[0].VALUE"))
  .withColumn("SIG3", get_json_object('SIG3, "$[0].VALUE"))
  .show
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.