6

I am using Databricks SQL to query a dataset that has a column formatted as an array, and each item in the array is a struct with 3 named fields.

I have the following table:

id array
1 [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]
2 [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}]

In a different SQL editor, I was able to achieve this by doing the following:

SELECT
id,
struct.firstName
FROM
table
CROSS JOIN UNNEST(array) as t(struct)

With a resulting table of:

id firstName
1 John
1 Jane
2 Bob
2 Betty

Unfortunately, this syntax does not work in the Databricks SQL editor, and I get the following error.

[UNRESOLVED_COLUMN] A column or function parameter with name `array` cannot be resolved.

I feel like there is an easy way to query this, but my search on Stack Overflow and Google has come up empty so far.

3
  • 1
    Hi Philip, could you share a sample of the data you are trying to query and ideally the expected output? Commented Nov 2, 2022 at 19:35
  • @BartoszGajda - unfortunately I am unable to share the exact data that I am querying. But I'll edit my post to attempt to show what I am needing. Commented Nov 3, 2022 at 18:35
  • 1
    @BartoszGajda - I just edited the post to reflect data that is structured and formatted like mine. Thanks for the help! (also thanks for the edits) Commented Nov 3, 2022 at 18:50

1 Answer 1

6

1. SQL API

The first solution uses the SQL API. The first code snippet prepares the test case, so you can ignore it if you already have it in place.

import pyspark.sql.types

schema = StructType([
    StructField('id', IntegerType(), True),
    StructField("people", ArrayType(StructType([
        StructField('firstName', StringType(), True),
        StructField('lastName', StringType(), True),
        StructField('age', StringType(), True)
    ])), True)
])

sql_df = spark.createDataFrame([
    (1, [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]),
    (2, [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}])
], schema)
sql_df.createOrReplaceTempView("sql_df")

What you need to use is the LATERAL VIEW clause (docs) which allows to explode the nested structures, like this:

SELECT id, exploded.firstName
FROM sql_df
LATERAL VIEW EXPLODE(sql_df.people) sql_df AS exploded;

+---+---------+
| id|firstName|
+---+---------+
|  1|     John|
|  1|     Jane|
|  2|      Bob|
|  2|    Betty|
+---+---------+

2. DataFrame API

The alternative approach is to use explode method (docs), which gives you the same results, like this:

from pyspark.sql.functions import explode, col

sql_df.select("id", explode(col("people.firstName"))).show()

+---+-----+
| id|  col|
+---+-----+
|  1| John|
|  1| Jane|
|  2|  Bob|
|  2|Betty|
+---+-----+
Sign up to request clarification or add additional context in comments.

1 Comment

The SQL API solution worked perfectly for me, thank you so much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.