how to query data from Pyspark sql context if key is not present in json fie , How to catch give sql analysis execption

Question

I am using Pyspark to transform JSON in a Dataframe. And I am successfully able to transform it. But the problem I am facing is there is a key which will be present in some JSON file and will not be present in another. When I flatten the JSON with Pyspark SQL context and the key is not present in some JSON file, it gives error in creating my Pyspark data frame, throwing SQL Analysis Exception.

for example my sample JSON

{
    "_id" : ObjectId("5eba227a0bce34b401e7899a"),
    "origin" : "inbound",
    "converse" : "72412952",
    "Start" : "2020-04-20T06:12:20.89Z",
    "End" : "2020-04-20T06:12:53.919Z",
    "ConversationMos" : 4.88228940963745,
    "ConversationRFactor" : 92.4383773803711,
    "participantId" : "bbe4de4c-7b3e-49f1-8",
}

The above JSON participant id will be available in some JSON and not in another JSON files

My pysaprk code snippet:

fetchFile = sark.read.format(file_type)\
                .option("inferSchema", "true")\
                .option("header","true")\
                .load(generated_FileLocation)

fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")

When, in some JSON file participantId is not present, an exception is coming. How to handle that kind of exception that if the key is not present so column will contain null or any other ways to handle it

Why not programmatically check if the schema contains the column and add it if it's not present? — ernest_k
– ernest_k, Commented May 14, 2020 at 5:10

Shubham Jain · Accepted Answer · 2020-05-15 05:27:02Z

1

You can simply check if the column is not there then add it will empty values. The code for the same goes like:

from pyspark.sql import functions as f
fetchFile = sark.read.format(file_type)\
                .option("inferSchema", "true")\
                .option("header","true")\
                .load(generated_FileLocation)


if not 'participantId' in df.columns:
   df = df.withColumn('participantId', f.lit(''))

fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")

answered May 15, 2020 at 5:27

Shubham Jain

5,6162 gold badges20 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Shubham kumar jain Over a year ago

thanks, shubham Jain, let me try this. This looks clean

Shubham Jain Over a year ago

Accept as answer if it helps you...:)

Douglas M · Accepted Answer · 2020-05-15 01:34:17Z

0

I think you're calling Spark to read one file at a time and inferring the schema at the same time.

What Spark is telling you with the SQL Analysis exception is that your file and your inferred schema doesn't have the key you're looking for. What you have to do is get to your good schema and apply it to all of the files you want to process. Ideally, processing all of your files at once.

There are three strategies:

Infer your schema from lots of files. You should get the aggregate of all of the keys. Spark will run two passes over the data.

df = spark.read.json('/path/to/your/directory/full/of/json/files')
schema = df.schema
print(schema)

Create a schema object I find this tedious to do, but will speed up your code. Here is a reference: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructType
Read the schema from a well formed file then use that to read your whole directory. Also, by printing the schema object, you can copy paste that back into your code for option #2.

schema = spark.read.json('path/to/well/formed/file.json')
print(schema)
my_df = spark.read.schema(schema).json('path/to/entire/folder/full/of/json')

answered May 15, 2020 at 1:34

Douglas M

1,1469 silver badges19 bronze badges

2 Comments

Shubham kumar jain Over a year ago

Thanks for reply Douglas but I am calling one by one because scenarios demand it considers it we are doing it for different users and afte completing one user then we are going to other users, and some user have paarticipant key some not because of use case and if I query participant in files which doesn't have the key that I want to handle

Douglas M Over a year ago

That's case #2. Create a schema object schema, and then run my_df = spark.read.schema(schema).json('path/to/entire/folder/specific_file.json')

Collectives™ on Stack Overflow

how to query data from Pyspark sql context if key is not present in json fie , How to catch give sql analysis execption

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related