0

I currently have a json file that i am trying to query with sqlContext.sql() that looks something like this:

{
  "sample": {
    "persons": [
      {
        "id": "123",
      },
      {
        "id": "456",
      }
    ]
  }
}

If I just want the first value I would type:

sqlContext.sql("SELECT sample.persons[0] FROM test")

but I want all the values of "persons" without having to write a loop. Loops just consume too much processing power, and given the size of these files, that would just be impractical.

I thought I would be able to put a range in the [] brackets but I can't find any syntax by which to do that.

0

1 Answer 1

3

If your schema looks like this:

root
 |-- sample: struct (nullable = true)
 |    |-- persons: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)

and want to access individual structs from persons array all you have to do is to explode it:

from pyspark.sql.functions import explode

df.select(explode("sample.persons").alias("person")).select("person.id")

See also: Querying Spark SQL DataFrame with complex types

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.