2

I use Spark 2.3.2 and read a multiline JSON file. This is the output of df.printSchema():

root
 |-- data: struct (nullable = true)
 |    |-- items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- context: struct (nullable = true)
 |    |    |    |    |-- environment: struct (nullable = true)
 |    |    |    |    |    |-- tag: struct (nullable = true)
 |    |    |    |    |    |    |-- weather: string (nullable = true)
 |    |    |    |    |    |-- weather: struct (nullable = true)
 |    |    |    |    |    |    |-- clouds: double (nullable = true)
 |    |    |    |    |    |    |-- rain: long (nullable = true)
 |    |    |    |    |    |    |-- temp: long (nullable = true)
 |    |    |    |    |-- personal: struct (nullable = true)
 |    |    |    |    |    |-- activity: struct (nullable = true)
 |    |    |    |    |    |    |-- conditions: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |    |-- kind: string (nullable = true)
 |    |    |    |    |    |-- status: struct (nullable = true)
 |    |    |    |    |    |    |-- speed: double (nullable = true)
 |    |    |    |    |-- timespace: struct (nullable = true)
 |    |    |    |    |    |-- geo: struct (nullable = true)
 |    |    |    |    |    |    |-- coordinates: array (nullable = true)
 |    |    |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |-- tag: struct (nullable = true)
 |    |    |    |    |    |    |-- season: string (nullable = true)
 |    |    |    |    |    |-- timestamp: string (nullable = true)
 |    |    |    |-- passport: struct (nullable = true)
 |    |    |    |    |-- pid: string (nullable = true)
 |    |    |    |    |-- uid: string (nullable = true)

It can be seen that the JSON file has a nested structure and it's not so trivial to retrieve particular nested features, for example, season, speed, etc.

This is how I read data:

SparkSession spark = SparkSession.builder()
                                 .config("spark.rdd.compress", "true")
                                 .appName("Test")
                                 .master("local[*]")
                                 .getOrCreate();
df = spark
    .read()
    .option("multiLine", true).option("mode", "PERMISSIVE")
    .json(filePath);

How can I get timestamp and weather tag in a separate Dataset?

timestamp  weather
...        ...
...        ...

I tried this, but it did not worked:

df.registerTempTable("df");
Dataset result = spark.sql("SELECT data.items.element.passport.uid FROM df");

or

Dataset result = df.withColumn("items",
                org.apache.spark.sql.functions.explode(df.col("data.items")))
                .select(df.col("items.context.environment.weather"));
1

1 Answer 1

1

You can read the multiline json file and select nested data like below.

//Read multiline json
Dataset<Row> ds = spark.read().option("multiLine", true).option("mode", "PERMISSIVE")
        .json("c:\\temp\\test.json");
//print schema
ds.printSchema();
//get weather
Dataset<Row> ds1 = ds.select("data.items.context.environment.weather");
ds1.show(false);
//get timestamp
Dataset<Row> ds2 = ds.select("data.items.context.timestamp");
ds2.show(false);
//get weather and timestamp
Dataset<Row> ds3 = ds.select("data.items.context.environment.weather", "data.items.context.timestamp");
ds3.show(false);

And with Spark 2.4.0 you can use explode and arrays_zip functions to explode and combine multiple columns

import static org.apache.spark.sql.functions.explode;
import static org.apache.spark.sql.functions.arrays_zip;
import static org.apache.spark.sql.functions.col;

Dataset<Row> ds4 = ds3.withColumn("values", explode(arrays_zip(col("weather"), col("timestamp")))).select(col("values.weather"), col("values.timestamp"));
ds4.show(false);
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.