1

I have the below dataframe which I have read from a JSON file.

1 2 3 4
{"todo":["wakeup", "shower"]} {"todo":["brush", "eat"]} {"todo":["read", "write"]} {"todo":["sleep", "snooze"]}

I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?

ID todo
1 wakeup, shower
2 brush, eat
3 read, write
4 sleep, snooze
0

2 Answers 2

1

The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.

What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.

Input dataframe:

df = spark.createDataFrame(
    [((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
    '`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')

Script:

to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")

df.show()
# +---+----------------+
# | ID|            todo|
# +---+----------------+
# |  1|[wakeup, shower]|
# |  2|    [brush, eat]|
# |  3|   [read, write]|
# |  4| [sleep, snooze]|
# +---+----------------+
Sign up to request clarification or add additional context in comments.

2 Comments

used underscore instead of '.' dots in the column name due to an issue but this is leading to another error : extraneous input 'todo' expecting {')', ','}
. is not for name, it's for referencing struct fields. The dot doesn't go anywhere, after using it, the struct field is extracted. However, if you don't want to use it, there is one more option: instead of .todo you can use ['todo']
1

Use from_json to convert string to array. Explode to cascade each unique element to row.

data

df = spark.createDataFrame(
    [(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
    ('value1','values2','value3','value4'))

code

new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
   .withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
              ) ).show(truncate=False)

outcome

+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1                     |values2                |value3                  |value4                    |todo          |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat    |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write   |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+

1 Comment

i changed my input quotes . Also, seeing an error data type mismatch in entries from argument 1 requires string type, however, '1' is of struct<todo:array<string>> type

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.