1

I Have dataframe containing array of key value pairs string, i want to get only keys from the key value Number of key value pairs is dynamic for each row and naming conventions are different.

Sample Input
------+-----+-----+-----+---------------------
|ID    |data| value                          |
+------+-----+-----+--------+-----------------
|e1    |D1  |["K1":"V1","K2":"V2","K3":"V3"] |
|e2    |D2  |["K1":"V1","K3":"V3"]           |
|e3    |D1  |["K1":"V1","K2":"V2"]           |
|e4    |D3  |["K2":"V2","K1":"V1","K3":"V3"] |
+------+-----+-----+--------+-----------------


Expected Result:

------+-----+-----+------
|ID    |data| value     |
+------+-----+-----+----|
|e1    |D1  |[K1|K2|K3] |
|e2    |D2  |[K1|K3]    |
|e3    |D1  |[K1|K2]    |
|e4    |D3  |[K2|K1|K3] |
+------+-----+-----+-----

2 Answers 2

2

For Spark 2.4+, use transform function.

For each element of the array, substring the key using substring_index and trim leading and trailing quotes using trim function.

df.show(truncate=False)
#+---+----+------------------------------------+
#|ID |data|value                               |
#+---+----+------------------------------------+
#|e1 |D1  |["K1":"V1", "K2": "V2", "K3": "V3"] |
#|e2 |D2  |["K1": "V1", "K3": "V3"]            |
#|e3 |D1  |["K1": "V1", "K2": "V2"]            |
#|e4 |D3  |["K2": "V2", "K1": "V1", "K3": "V3"]|
#+---+----+------------------------------------+    

new_value = """ transform(value, x -> trim(BOTH '"' FROM substring_index(x, ':', 1))) """
df.withColumn("value", expr(new_value)).show()

#+---+----+------------+
#|ID |data|value       |
#+---+----+------------+
#|e1 |D1  |[K1, K2, K3]|
#|e2 |D2  |[K1, K3]    |
#|e3 |D1  |[K1, K2]    |
#|e4 |D3  |[K2, K1, K3]|
#+---+----+------------+

If you want the result as a string delimited by |, you can use array_join like this:

df.withColumn("value", array_join(expr(new_value), "|")).show()
#+---+----+--------+
#|ID |data|value   |
#+---+----+--------+
#|e1 |D1  |K1|K2|K3|
#|e2 |D2  |K1|K3   |
#|e3 |D1  |K1|K2   |
#|e4 |D3  |K2|K1|K3|
#+---+----+--------+
Sign up to request clarification or add additional context in comments.

Comments

1

You can split the value into array which contains key and value.

df.withColumn("keys", expr('transform(value, keyValue -> trim(split(keyValue, ":")[0]))')).drop("value")

1 Comment

Right, there is some limitation on arraytype column iterable. So you can change into: df.withColumn("keys", expr('transform(value, keyValue -> trim(split(keyValue, ":")[0]))')).drop("value").

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.