How to query arrays in JSON with either SparkSQL or PySpark in Apache Spark with Databricks

Question

I have the the data in a JSON file (just showing the first row)

{
    "cd_created_date": "2021-10-05T21:33:39.480933",
    "cd_jurisdiction": "PROBATE",
    "cd_last_modified": "2021-10-05T21:35:04.061105",
    "cd_last_state_modified_date": "2021-10-05T21:35:04.060968",
    "cd_latest_state": "WillWithdrawn",
    "cd_reference": 1633469619443286,
    "cd_security_classification": "PUBLIC",
    "cd_version": 7,
    "ce_case_data_id": 3483511,
    "ce_case_type_id": "WillLodgement",
    "ce_case_type_version": 170,
    "ce_created_date": "2021-10-05T21:33:51.189872",
    "ce_data": "{\"willDate\": \"1950-01-01\", \"jointWill\": \"Yes\", \"lodgedDate\": \"1970-03-03\", \"codicilDate\": \"1962-02-02\", \"executorTitle\": \"Mr\", \"lodgementType\": \"safeCustody\", \"deceasedGender\": \"male\", \"applicationType\": \"Personal\", \"deceasedAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"deceasedSurname\": \"E2E_deceased_surname_1633469477956\", \"executorAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"executorSurname\": \"executor1_surname\", \"numberOfCodicils\": \"3\", \"registryLocation\": \"Liverpool\", \"deceasedForenames\": \"E2E_deceased_forenames_1633469477956\", \"documentsUploaded\": [{\"id\": \"b1181bfb-d0a7-49d8-8301-b06e58eb42c1\", \"value\": {\"Comment\": \"test file to upload\", \"DocumentLink\": {\"document_url\": \"http://dm-store-aat.service.core-compute-aat.internal/documents/60cd0a78-648e-4af7-9f83-15a380f1786d\", \"document_filename\": \"test_file_for_document_upload.png\", \"document_binary_url\": \"http://dm-store-aat.service.core-compute-aat.internal/documents/60cd0a78-648e-4af7-9f83-15a380f1786d/binary\"}, \"DocumentType\": \"email\"}}], \"executorForenames\": \"executor1_forenames\", \"deceasedDateOfBirth\": \"1930-01-01\", \"deceasedDateOfDeath\": \"2017-01-01\", \"deceasedTypeOfDeath\": \"diedOnOrAbout\", \"deceasedEmailAddress\": \"[email protected]\", \"executorEmailAddress\": \"[email protected]\", \"deceasedAnyOtherNames\": \"Yes\", \"additionalExecutorList\": [{\"id\": \"bd8d7ca2-ed84-424c-a241-40fd99b15596\", \"value\": {\"executorTitle\": \"Dr\", \"executorAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"executorSurname\": \"executor2_surname\", \"executorForenames\": \"executor2_forenames\", \"executorEmailAddress\": \"[email protected]\"}}], \"deceasedFullAliasNameList\": [{\"id\": \"1970ac9d-532e-48f2-8851-04ae6eec973f\", \"value\": {\"FullAliasName\": \"deceased_alias1_1633469477956\"}}, {\"id\": \"86c842aa-e10f-44d5-8c28-a7ce8a1cb0eb\", \"value\": {\"FullAliasName\": \"deceased_alias2\"}}]}",
    "ce_description": "upload_document_event_description_text",
    "ce_event_id": "uploadDocument",
    "ce_event_name": "Upload document",
    "ce_id": 30638630,
    "ce_security_classification": "PUBLIC",
    "ce_state_id": "WillLodgementCreated",
    "ce_state_name": "Will lodgement created",
    "ce_summary": "upload_document_event_summary_text",
    "ce_user_first_name": "Probate",
    "ce_user_id": "349978",
    "ce_user_last_name": "Backoffice",
    "extraction_date": "2021-10-06"
}

AS you can see the field ce_data contains an array.

When I read in the JSON in Apache Spark with Databricks I get the following printSchema()

root
 |-- cd_created_date: string (nullable = true)
 |-- cd_jurisdiction: string (nullable = true)
 |-- cd_last_modified: string (nullable = true)
 |-- cd_last_state_modified_date: string (nullable = true)
 |-- cd_latest_state: string (nullable = true)
 |-- cd_reference: long (nullable = true)
 |-- cd_security_classification: string (nullable = true)
 |-- cd_version: long (nullable = true)
 |-- ce_case_data_id: long (nullable = true)
 |-- ce_case_type_id: string (nullable = true)
 |-- ce_case_type_version: long (nullable = true)
 |-- ce_created_date: string (nullable = true)
 **|-- ce_data: string (nullable = true)**
 |-- ce_description: string (nullable = true)
 |-- ce_event_id: string (nullable = true)
 |-- ce_event_name: string (nullable = true)
 |-- ce_id: long (nullable = true)
 |-- ce_security_classification: string (nullable = true)
 |-- ce_state_id: string (nullable = true)
 |-- ce_state_name: string (nullable = true)
 |-- ce_summary: string (nullable = true)
 |-- ce_user_first_name: string (nullable = true)
 |-- ce_user_id: string (nullable = true)
 |-- ce_user_last_name: string (nullable = true)
 |-- extraction_date: string (nullable = true)

As you can see from the above PrintSchema in Databricks the field ce_data does not show up as an array.

However, I would like to query the arrays in the ce_data field. For example I would like to write a query that founds a lodgeDate = 1970-03-03?

My attempt would be something like

test = spark.sql("""select ce_data from testtable where ce_data.lodgeDate = '1970-03-03'""")

The error I'm getting when I enter the above code in Databricks is:

Can't extract value from ce_data#12747: need struct type but got string;

So, I would first need to understand why I'm not seeing the arrays in the printSchema(), however my main question is how to query arrays in JSON using sparkSQL.

I'm also wondering if I need to import some libraries?

pltc · Accepted Answer · 2021-10-11 03:20:19Z

2

As you already mentioned, the ce_data is a string with JSON content, assuming the JSON is valid, you can use get_json_object function to extract JSON's properties like so

spark.sql("""
select ce_data
from testtable
where get_json_object(ce_data, "$.lodgedDate") = "1970-03-03"
""").show()

However, if you ask me, I'd say I'm much prefer the Python syntax than SQL syntax. It's much cleaner this way

from pyspark.sql import functions as F
(df
    .where(F.get_json_object('ce_data', '$.lodgedDate') == '1970-03-03')
    .show()
)

answered Oct 11, 2021 at 3:20

pltc

6,0371 gold badge16 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Patterson Over a year ago

pltc, thank for this. I haven't had a time to test, your answer, but if it works I will be running with spark.sql. Thanks again.

Patterson Over a year ago

This worked perfectly. Thank you very much

Arran Duff · Accepted Answer · 2021-10-11 09:07:55Z

It looks as though your field is a json string rather than an array. You can convert it to a struct type with the ‘from_json’ method. Then you can query it with your code above. Note that you will need the schema of the json to use the from_json method. For example

  from pyspark.sql import functions as f
  # get your json schema
  json_string = "{\"willDate\": \"1950-01-01\", \"jointWill\": \"Yes\", \"lodgedDate\": \"1970-03-03\", \"codicilDate\": \"1962-02-02\", \"executorTitle\": \"Mr\", \"lodgementType\": \"safeCustody\", \"deceasedGender\": \"male\", \"applicationType\": \"Personal\", \"deceasedAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"deceasedSurname\": \"E2E_deceased_surname_1633469477956\", \"executorAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"executorSurname\": \"executor1_surname\", \"numberOfCodicils\": \"3\", \"registryLocation\": \"Liverpool\", \"deceasedForenames\": \"E2E_deceased_forenames_1633469477956\", \"documentsUploaded\": [{\"id\": \"b1181bfb-d0a7-49d8-8301-b06e58eb42c1\", \"value\": {\"Comment\": \"test file to upload\", \"DocumentLink\": {\"document_url\": \"http://dm-store-aat.service.core-compute-aat.internal/documents/60cd0a78-648e-4af7-9f83-15a380f1786d\", \"document_filename\": \"test_file_for_document_upload.png\", \"document_binary_url\": \"http://dm-store-aat.service.core-compute-aat.internal/documents/60cd0a78-648e-4af7-9f83-15a380f1786d/binary\"}, \"DocumentType\": \"email\"}}], \"executorForenames\": \"executor1_forenames\", \"deceasedDateOfBirth\": \"1930-01-01\", \"deceasedDateOfDeath\": \"2017-01-01\", \"deceasedTypeOfDeath\": \"diedOnOrAbout\", \"deceasedEmailAddress\": \"[email protected]\", \"executorEmailAddress\": \"[email protected]\", \"deceasedAnyOtherNames\": \"Yes\", \"additionalExecutorList\": [{\"id\": \"bd8d7ca2-ed84-424c-a241-40fd99b15596\", \"value\": {\"executorTitle\": \"Dr\", \"executorAddress\": {\"County\": \"London\", \"Country\": \"United Kingdom\", \"PostCode\": \"SW1A 1AA\", \"PostTown\": \"London\", \"AddressLine1\": \"1\", \"AddressLine2\": \"Buckingham Palace\", \"AddressLine3\": \"The place to be\"}, \"executorSurname\": \"executor2_surname\", \"executorForenames\": \"executor2_forenames\", \"executorEmailAddress\": \"[email protected]\"}}], \"deceasedFullAliasNameList\": [{\"id\": \"1970ac9d-532e-48f2-8851-04ae6eec973f\", \"value\": {\"FullAliasName\": \"deceased_alias1_1633469477956\"}}, {\"id\": \"86c842aa-e10f-44d5-8c28-a7ce8a1cb0eb\", \"value\": {\"FullAliasName\": \"deceased_alias2\"}}]}"
  your_json_schema = f.schema_of_json(json_string)
  df = df.withColumn(“ce_data”,   f.from_json(df.ce_data,schema=your_json_schema)
   
   df = df.filter("ce_data.lodgeDate = '1970-03-03'")

If there are arrays in the struct. You can create a new row per array field with the f.explode method. Then from there you can query them as you would with any column

thanks for reaching out. TBH, I thought there might be an easier solution. I was also hoping to get answer showing the results with sparkSQL
It really depends on your use case. If you only want to make one or 2 queries on the json string field then the solution provided by @pltc is probably your best bet. I was thinking that you wanted your data to be more structured so that it's easier to query. In which case converting the field to a struct type is your best bet. Once you have converted to a struct type, then you can use sparkSQL exactly as you wrote it in the question. But similar to pltc, I much prefer the pyspark.sql as it's easier to debug and catch errors

Collectives™ on Stack Overflow

How to query arrays in JSON with either SparkSQL or PySpark in Apache Spark with Databricks

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related