1

Can someone let me know where I'm going wrong with my attempt to concatenate a nested JSON field.

I'm using the following code:

df = (df
       .withColumn("ingestion_date", current_timestamp())
       .withColumn("name", concat(col("name.forename"), 
             lit(" "), col("name.surname"))))
     )

Schema:

root  
|-- driverRef: string (nullable = true)  
|-- number: integer (nullable = true)  
|-- code: string (nullable = true)  
|-- forename: string (nullable = true)  
|-- surname: string (nullable = true)  
|-- dob: date (nullable = true)

As you can see, I'm trying to concatenate forname & surname, so as to provide a full name in the name field. At the present the data looks like the following:

enter image description here

After concatenating the 'name' field there should be one single value e.g. the 'name' field would just show Lewis Hamilton, and like wise for the other values in the 'name' field.

My code produces the following error:

Can't extract value from name#6976: need struct type but got string
5
  • Can you include the scheme of the dataframe using df.printSchema()? Commented Nov 13, 2021 at 18:18
  • The full schema is as follows: root |-- driverRef: string (nullable = true) |-- number: integer (nullable = true) |-- code: string (nullable = true) |-- forename: string (nullable = true) |-- surname: string (nullable = true) |-- dob: date (nullable = true) Commented Nov 13, 2021 at 18:23
  • Sorry, I don't know how to edit the code to make it look neater Commented Nov 13, 2021 at 18:23
  • you can edit the question to update it with the schama, it's very to get the nesting right since answers would depend on it. Commented Nov 13, 2021 at 19:04
  • 2
    In the schema you've provided I can't see name column. From the error I suspect that name column is a string and you cannot extract values from it like it's a map. Update schema to have a MapType(StringType(), StringType()) and you should be able to access the values. Commented Nov 13, 2021 at 22:24

1 Answer 1

2

It would seem that you have a dataframe that contains a name column containing a json with two values: forename and surname, just like this {"forename": "Lewis", "surname" : "Hamilton"}.

That column, in spark, has a string type. That explains the error you obtain. You could only do name.forename if name were of type struct with a field called forename. That what spark means by need struct type but got string.

You just need to tell spark that this string column is a JSON and how to parse it.

from pyspark.sql.types import StructType, StringType, StructField
from pyspark.sql import functions as f

# initializing data
df = spark.range(1).withColumn('name',
       f.lit('{"forename": "Lewis", "surname" : "Hamilton"}'))
df.show(truncate=False)
+---+---------------------------------------------+
|id |name                                         |
+---+---------------------------------------------+
|0  |{"forename": "Lewis", "surname" : "Hamilton"}|
+---+---------------------------------------------+

And parsing that JSON:

json_schema = StructType([
    StructField('forename', StringType()),
    StructField('surname', StringType())
])

df\
    .withColumn('s', f.from_json(f.col('name'), json_schema))\
    .withColumn("name", f.concat_ws(" ", f.col("s.forename"), f.col("s.surname")))\
    .show()
+---+--------------+-----------------+
| id|          name|                s|
+---+--------------+-----------------+
|  0|Lewis Hamilton|{Lewis, Hamilton}|
+---+--------------+-----------------+

You may than get rid of s with drop, it contains the parsed struct.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.