How to concatenate nested json in Apache Spark

Question

Can someone let me know where I'm going wrong with my attempt to concatenate a nested JSON field.

I'm using the following code:

df = (df
       .withColumn("ingestion_date", current_timestamp())
       .withColumn("name", concat(col("name.forename"), 
             lit(" "), col("name.surname"))))
     )

Schema:

root  
|-- driverRef: string (nullable = true)  
|-- number: integer (nullable = true)  
|-- code: string (nullable = true)  
|-- forename: string (nullable = true)  
|-- surname: string (nullable = true)  
|-- dob: date (nullable = true)

As you can see, I'm trying to concatenate forname & surname, so as to provide a full name in the name field. At the present the data looks like the following:

After concatenating the 'name' field there should be one single value e.g. the 'name' field would just show Lewis Hamilton, and like wise for the other values in the 'name' field.

My code produces the following error:

Can't extract value from name#6976: need struct type but got string

Can you include the scheme of the dataframe using df.printSchema()? — Nithish
– Nithish, Commented Nov 13, 2021 at 18:18
The full schema is as follows: root |-- driverRef: string (nullable = true) |-- number: integer (nullable = true) |-- code: string (nullable = true) |-- forename: string (nullable = true) |-- surname: string (nullable = true) |-- dob: date (nullable = true) — Patterson
– Patterson, Commented Nov 13, 2021 at 18:23
Sorry, I don't know how to edit the code to make it look neater — Patterson
– Patterson, Commented Nov 13, 2021 at 18:23
you can edit the question to update it with the schama, it's very to get the nesting right since answers would depend on it. — Nithish
– Nithish, Commented Nov 13, 2021 at 19:04
In the schema you've provided I can't see name column. From the error I suspect that name column is a string and you cannot extract values from it like it's a map. Update schema to have a MapType(StringType(), StringType()) and you should be able to access the values. — vladsiv
– vladsiv, Commented Nov 13, 2021 at 22:24

Oli · Accepted Answer · 2021-11-14 16:52:15Z

It would seem that you have a dataframe that contains a name column containing a json with two values: forename and surname, just like this {"forename": "Lewis", "surname" : "Hamilton"}.

That column, in spark, has a string type. That explains the error you obtain. You could only do name.forename if name were of type struct with a field called forename. That what spark means by need struct type but got string.

You just need to tell spark that this string column is a JSON and how to parse it.

from pyspark.sql.types import StructType, StringType, StructField
from pyspark.sql import functions as f

# initializing data
df = spark.range(1).withColumn('name',
       f.lit('{"forename": "Lewis", "surname" : "Hamilton"}'))
df.show(truncate=False)

+---+---------------------------------------------+
|id |name                                         |
+---+---------------------------------------------+
|0  |{"forename": "Lewis", "surname" : "Hamilton"}|
+---+---------------------------------------------+

And parsing that JSON:

json_schema = StructType([
    StructField('forename', StringType()),
    StructField('surname', StringType())
])

df\
    .withColumn('s', f.from_json(f.col('name'), json_schema))\
    .withColumn("name", f.concat_ws(" ", f.col("s.forename"), f.col("s.surname")))\
    .show()

+---+--------------+-----------------+
| id|          name|                s|
+---+--------------+-----------------+
|  0|Lewis Hamilton|{Lewis, Hamilton}|
+---+--------------+-----------------+

You may than get rid of s with drop, it contains the parsed struct.

Collectives™ on Stack Overflow

How to concatenate nested json in Apache Spark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related