3

I am trying to parse a JSON file, selectively read only 50+ data elements (out of 800+) into DataFrame in PySpark. One of the data elements (issues.customfield_666) is a Struct Type (with 3 fields Id/Name/Tag under it). Sometimes data in this Struct field comes as null. When that happens, spark job execution fails with the below error. How to ignore/suppress this error for null values?

Error is happening only when parsing JSON file #1 (where customfield_66 is coming as null).

AnalysisException: Can't extract value from issues.customfield_666: need struct type but got string

JSON File 1 (Where customfield_666 has only null)
{
    "startAt": 0,
    "total": 1,
    "issues": [
            {
            "id": "1",
            "key": "BSE-444",
            "issuetype": {
                            "id": "30",
                            "name": "Epic1",
                         },         
            "customfield_666": null
            }
        ]
}


JSON File 2 (Where customfield_666 has both null and struct values)
{
    "startAt": 0,
    "total": 2,
    "issues": [
            {
            "id": "1",
            "key": "BSE-444",
            "issuetype": {
                            "id": "30",
                            "name": "Epic1",
                         },         
            "customfield_666": null
            },
            {
            "id": "2",
            "key": "BSE-555",
            "issuetype": {
                            "id": "40",
                            "name": "Epic2",
                         },         
            "customfield_666":  
                            {
                                "tag":  "Smoke Testing",
                                "id":  "666-01",
                            },      
            }           
            
        ]
}

Below is the PySpark code used to parse above JSON data.

from pyspark.sql.functions import *

rawDF = spark.read.json("abfss://[email protected]/raw/MyData.json", multiLine = "true")

DF = rawDF.select(explode("issues").alias("issues")) \
                .select(
                       col("issues.id").alias("IssueId"), 
                       col("issues.key").alias("IssueKey"), 
                       col("issues.fields").alias("IssueFields"),
                       col("issues.issuetype.name").alias("IssueTypeName"),
                       col("issues.customfield_666.tag").alias("IssueCust666Tag")
                      )

1 Answer 1

3

You may check if it is null first

from pyspark.sql import functions as F

DF = rawDF.select(F.explode("issues").alias("issues")) \
          .select(
              F.col("issues.id").alias("IssueId"), 
              F.col("issues.key").alias("IssueKey"), 
              F.col("issues.fields").alias("IssueFields"),
              F.col("issues.issuetype.name").alias("IssueTypeName"),
              F.when(
                  F.col("issues.customfield_666").isNull() |  (F.trim(F.col("issues.customfield_666").cast("string"))==""), None
              ).otherwise(
                  F.col("issues.customfield_666.tag")
              ).alias("IssueCust666Tag")
         )

Let me know if this works for you

Sign up to request clarification or add additional context in comments.

5 Comments

Hi @ggordon, sorry I was little soon to accept your answer. Your solution works if the file has a mix of string type (null values) and struct type. But if it has all nulls, then I am still getting the same error. Any suggestion?
@Kris I've updated the answer. Let me know if this works for you.
Thanks @ggordon! Tried your suggestion with trim/cast, but still getting the same error.
@Kris Would you be able to share the output of rawDf.show() for some of these problematic records you've identified? I'm finding it hard to replicate on my end.
Hi @ggordon I added 2 source files to the question (in the code section). File 1 is giving the error. Please let me know if this helps or if you still prefer to have output from rawDf.show(). Thank you!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.