Extract key value from dataframe in PySpark

Question

I have the below dataframe which I have read from a JSON file.

1	2	3	4
{"todo":["wakeup", "shower"]}	{"todo":["brush", "eat"]}	{"todo":["read", "write"]}	{"todo":["sleep", "snooze"]}

I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?

ID	todo
1	wakeup, shower
2	brush, eat
3	read, write
4	sleep, snooze

ZygD · Accepted Answer · 2022-10-17 19:20:46Z

1

The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.

What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.

Input dataframe:

df = spark.createDataFrame(
    [((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
    '`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')

Script:

to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")

df.show()
# +---+----------------+
# | ID|            todo|
# +---+----------------+
# |  1|[wakeup, shower]|
# |  2|    [brush, eat]|
# |  3|   [read, write]|
# |  4| [sleep, snooze]|
# +---+----------------+

edited Oct 17, 2022 at 19:20

answered Oct 17, 2022 at 19:15

ZygD

24.8k41 gold badges107 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RData Over a year ago

used underscore instead of '.' dots in the column name due to an issue but this is leading to another error : extraneous input 'todo' expecting {')', ','}

ZygD Over a year ago

. is not for name, it's for referencing struct fields. The dot doesn't go anywhere, after using it, the struct field is extracted. However, if you don't want to use it, there is one more option: instead of .todo you can use ['todo']

wwnde · Accepted Answer · 2022-10-17 23:37:33Z

1

Use from_json to convert string to array. Explode to cascade each unique element to row.

data

df = spark.createDataFrame(
    [(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
    ('value1','values2','value3','value4'))

code

new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
   .withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
              ) ).show(truncate=False)

outcome

+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1                     |values2                |value3                  |value4                    |todo          |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat    |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write   |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+

edited Oct 17, 2022 at 23:37

answered Oct 17, 2022 at 23:30

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

1 Comment

RData Over a year ago

i changed my input quotes . Also, seeing an error data type mismatch in entries from argument 1 requires string type, however, '1' is of struct<todo:array<string>> type

Collectives™ on Stack Overflow

Extract key value from dataframe in PySpark

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related