Querying struct within array - Databricks SQL

Question

I am using Databricks SQL to query a dataset that has a column formatted as an array, and each item in the array is a struct with 3 named fields.

I have the following table:

id	array
1	[{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]
2	[{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}]

In a different SQL editor, I was able to achieve this by doing the following:

SELECT
id,
struct.firstName
FROM
table
CROSS JOIN UNNEST(array) as t(struct)

With a resulting table of:

id	firstName
1	John
1	Jane
2	Bob
2	Betty

Unfortunately, this syntax does not work in the Databricks SQL editor, and I get the following error.

[UNRESOLVED_COLUMN] A column or function parameter with name `array` cannot be resolved.

I feel like there is an easy way to query this, but my search on Stack Overflow and Google has come up empty so far.

Hi Philip, could you share a sample of the data you are trying to query and ideally the expected output? — Bartosz Gajda
– Bartosz Gajda, Commented Nov 2, 2022 at 19:35
@BartoszGajda - unfortunately I am unable to share the exact data that I am querying. But I'll edit my post to attempt to show what I am needing. — Philip
– Philip, Commented Nov 3, 2022 at 18:35
@BartoszGajda - I just edited the post to reflect data that is structured and formatted like mine. Thanks for the help! (also thanks for the edits) — Philip
– Philip, Commented Nov 3, 2022 at 18:50

Bartosz Gajda · Accepted Answer · 2022-11-03 19:29:53Z

6

1. SQL API

The first solution uses the SQL API. The first code snippet prepares the test case, so you can ignore it if you already have it in place.

import pyspark.sql.types

schema = StructType([
    StructField('id', IntegerType(), True),
    StructField("people", ArrayType(StructType([
        StructField('firstName', StringType(), True),
        StructField('lastName', StringType(), True),
        StructField('age', StringType(), True)
    ])), True)
])

sql_df = spark.createDataFrame([
    (1, [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]),
    (2, [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}])
], schema)
sql_df.createOrReplaceTempView("sql_df")

What you need to use is the LATERAL VIEW clause (docs) which allows to explode the nested structures, like this:

SELECT id, exploded.firstName
FROM sql_df
LATERAL VIEW EXPLODE(sql_df.people) sql_df AS exploded;

+---+---------+
| id|firstName|
+---+---------+
|  1|     John|
|  1|     Jane|
|  2|      Bob|
|  2|    Betty|
+---+---------+

2. DataFrame API

The alternative approach is to use explode method (docs), which gives you the same results, like this:

from pyspark.sql.functions import explode, col

sql_df.select("id", explode(col("people.firstName"))).show()

+---+-----+
| id|  col|
+---+-----+
|  1| John|
|  1| Jane|
|  2|  Bob|
|  2|Betty|
+---+-----+

answered Nov 3, 2022 at 19:29

Bartosz Gajda

1,1877 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Philip Over a year ago

The SQL API solution worked perfectly for me, thank you so much!

Collectives™ on Stack Overflow

Querying struct within array - Databricks SQL

1 Answer 1

1. SQL API

2. DataFrame API

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1. SQL API

2. DataFrame API

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related