Explode a string column with dictionary structure in PySpark

Question

I have a pyspark dataframe which looks like this-

RowNumber	value
1	[{[email protected], Name=abc}, {[email protected], Name=mnc}]
2	[{[email protected], Name=klo}, {[email protected], Name=mmm}]

The column "value" is of string type.

root
  |--value: string (nullable = false)
  |--rowNumber: integer (nullable = false)

Step 1 I need to explode the dictionaries inside the list on each row under column "value" like this-

Step 2 And then to further explode the column so that the resultant table looks like :

Although When I try to get to Step1 using:

df.select(explode(col('value')).alias('value'))

it shows me error:

Analysis Exception: cannot resolve 'explode("value")' due to data type mismatch: input to function explode should be array or map type, not string

How do I convert this string under column 'value' to compatible data types so that I can proceed with exploding the dictionary elements as valid array/json (step1) and then into separate columns (step2) ?

please help

dsillman2000 · Accepted Answer · 2022-05-11 19:11:16Z

1

EDIT: There may be a simpler way to do this with the from_json function and StructTypes, for that you can check out this link.

The `ast` way

To parse a string as a dictionary or list in Python, you can use the ast library's literal_eval function. If we convert this function to a PySpark UDF, then the following code will suffice:

from pyspark.sql.functions import udf, col
from pyspark.sql.types import MapType
from ast import literal_eval

literal_eval_udf = udf(literal_eval, ArrayType(MapType()))

table = table.withColumn("value", literal_eval_udf(col("value"))) # Make strings into ArrayTypes of MapTypes
table = table.withColumn("value", explode(col("value"))) # Explode ArrayTypes such that each row contains a MapType

After applying these functions to the table, what should remain is what you originally referred to as the start of "step 2." From here, we want to split each "value" column key into a column with entries from the corresponding value. This is accomplished with another function application which gives us the dict values:

table = table.withColumn("value", map_values(col("value")))

Now the values column contains an ArrayType of the values contained in each dictionary. To make a separate column for each of these, we simply add them in a loop:

keys = ['mail', 'Name']
for k in range(len(keys)):
  table = table.withColumn(keys[k], table.value[k])

Then you can drop the original value column because we wouldn't need it anymore, as you'll now have the columns mail and Name with the information from the corresponding maps.

edited May 11, 2022 at 19:11

answered May 11, 2022 at 18:14

dsillman2000

1,0471 gold badge9 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ABusy_developerT Over a year ago

I am getting error while trying to call the literal_eval function: ``` TypeError: __init__() missing 2 required positional arguments: 'keyType' and 'valueType' ```

Collectives™ on Stack Overflow

Explode a string column with dictionary structure in PySpark

1 Answer 1

The `ast` way

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

The ast way

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related

The `ast` way