Convert JSON Data into DataFrame Apache Spark

Question

I want to convert my json data into a dataframe to make it more manageable.

Command to get data into java

Dataset<Row> df = spark.read()
   .option("multiline",true)
   .option("mode", "PERMISSIVE")
   .json("hdfs://hd-master:9820/houseInformation.txt");

Data

{
 "House1": {
   "House_Id": "1",
   "Cover": "1.000",
   "HouseType": "bungalow",
   "Facing": "South",
   "Region": "YVR",
   "Ru": "1",
   "HVAC": [
     "FAGF",
     "FPG",
     "HP"
   ]
 },
 "House2" : {...},
 "House3" : {...},
}

If it's possible I would like to remove the key "House1" and then convert the rest of the data into a df. If not then it's fine too.

but ideally something like this is my desired output

HouseName  House_Id  Cover   HouseType Facing Region Ru HVAC
 House1       1     1.000     bungalow  South   YVR   1  []

Schema

root
 |-- House1: struct (nullable = true)
 |    |-- Cover: string (nullable = true)
 |    |-- Facing: string (nullable = true)
 |    |-- HVAC: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- HouseType: string (nullable = true)
 |    |-- House_Id: string (nullable = true)
 |    |-- Region: string (nullable = true)
 |    |-- Ru: string (nullable = true)

Simply printing

df.select(functions.col("House1")).show(false);

Returns this

+----------------------------------------------------+
|House1                                              |
+----------------------------------------------------+
|[1.000, South, [FAGF, FPG, HP], bungalow, 1, YVR, 1]|
+----------------------------------------------------+

house column is of type struct, to see all columns of struct Try - df.select(functions.col("House1.*")).show(false); — s.polam
– s.polam, Commented Dec 16, 2020 at 5:12

sparker · Accepted Answer · 2020-12-16 14:42:36Z

4

Your input json file should be in jsonlines format (https://jsonlines.org/) whereas this is a single json document.

If all these "house" elements had the same key and if the file was in jsonlines format, they would automatically be separate rows in your Dataframe. But here the keys here are different i.e. House1, House2, etc and hence it is seen as a single record. There are an arbitrary number of "houses" but each with a different key and so they are treated as different columns.

If you write complex transformation to achieve your desired outcome, it would not be efficient. A single json record is not going to be parallelised by spark. There will be a single executor doing all the work and your dataframe would have an arbitrarily large number of columns. You may as well not use spark for this use case.

Therefore, the more sensible way IMO to fix this would be to fix your source document. This could mean possibly preprocessing your source file to covert it to jsonlines.

edited Dec 16, 2020 at 14:42

answered Dec 16, 2020 at 14:33

sparker

1,32511 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nikster Over a year ago

Even though all the other answers gave me the answer I desired, I went with this one. It was the most efficient. I modified my file into a proper format and informed my professor of the changes.

sparker Over a year ago

One more tip that may help someone. Sometimes you may get an input json file with a json array. This can be easily converted to jsonlines by removing the first and last characters in the file i.e. removing [ at the start of the file and ] at the end of the file

s.polam · Accepted Answer · 2020-12-16 14:01:35Z

2

As House1 column is of type struct to extract all columns from struct type use House1.*.

Try below code.

df
.select(functions.col("House1.*"))
.show(false)

edited Dec 16, 2020 at 14:01

answered Dec 16, 2020 at 5:38

s.polam

10.4k2 gold badges17 silver badges29 bronze badges

Comments

Vijay_Shinde · Accepted Answer · 2020-12-18 09:53:09Z

Using Scala

1. Reading the House JSON data. Please note that, I am giving one json input row in single line

{ "House1": { "House_Id": "1", "Cover": "1.000", "HouseType": "bungalow", "Facing": "South", "Region": "YVR", "Ru": "1", "HVAC": [ "FAGF", "FPG", "HP" ] } }
{ "House2": { "House_Id": "2", "Cover": "1.000", "HouseType": "bungalow", "Facing": "North", "Region": "YVR", "Ru": "1", "HVAC": [ "FAGF", "FPG", "HP" ] } }

Code

val houseDS = spark.read.json("<JSON_FILE_PATH>");
houseDS.printSchema
root
 |-- House1: struct (nullable = true)
 |    |-- Cover: string (nullable = true)
 |    |-- Facing: string (nullable = true)
 |    |-- HVAC: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- HouseType: string (nullable = true)
 |    |-- House_Id: string (nullable = true)
 |    |-- Region: string (nullable = true)
 |    |-- Ru: string (nullable = true)
 |-- House2: struct (nullable = true)
 |    |-- Cover: string (nullable = true)
 |    |-- Facing: string (nullable = true)
 |    |-- HVAC: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- HouseType: string (nullable = true)
 |    |-- House_Id: string (nullable = true)
 |    |-- Region: string (nullable = true)
 |    |-- Ru: string (nullable = true)

houseDS.show(false)
+----------------------------------------------------+----------------------------------------------------+
|House1                                              |House2                                              |
+----------------------------------------------------+----------------------------------------------------+
|[1.000, South, [FAGF, FPG, HP], bungalow, 1, YVR, 1]|null                                                |
|null                                                |[1.000, North, [FAGF, FPG, HP], bungalow, 2, YVR, 1]|
+----------------------------------------------------+----------------------------------------------------+

2. We are using the stack() function to separate multiple columns into rows. Here is the stack function syntax: stack(n, expr1, ..., exprk) - Separates expr1, ..., exprk into n rows.

val houseDS2 = houseDS.select(expr("stack(2,House1, 'House1', House2, 'House2') as (house,HouseName)")).na.drop
houseDS2.printSchema
root
 |-- house: struct (nullable = true)
 |    |-- Cover: string (nullable = true)
 |    |-- Facing: string (nullable = true)
 |    |-- HVAC: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- HouseType: string (nullable = true)
 |    |-- House_Id: string (nullable = true)
 |    |-- Region: string (nullable = true)
 |    |-- Ru: string (nullable = true)
 |-- HouseName: string (nullable = true)

3. Then selecting all the required columns from above houseDS2 DataSet

val finalHouseDS = houseDS2.select("HouseName","house.House_Id","house.Cover","house.HouseType","house.Facing","house.Region","house.Ru","house.HVAC")
finalHouseDS.show(false)

Your expected output

+---------+--------+-----+---------+------+------+---+---------------+
|HouseName|House_Id|Cover|HouseType|Facing|Region|Ru |HVAC           |
+---------+--------+-----+---------+------+------+---+---------------+
|House1   |1       |1.000|bungalow |South |YVR   |1  |[FAGF, FPG, HP]|
|House2   |2       |1.000|bungalow |North |YVR   |1  |[FAGF, FPG, HP]|
+---------+--------+-----+---------+------+------+---+---------------+

You can implement similarly in Java. Please let me know if you face any performance issue for larger DataSet.

Using Java

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.functions.*;

public class ParseJson {
    public static void main(String[] args) {
        System.setProperty("hadoop.home.dir", "D:\\Software\\Hadoop");

        SparkSession spark = SparkSession
                .builder()
                .appName("Testing")
                .master("local[*]")
                .getOrCreate();
        // Read json data

        Dataset<Row> houseDS = spark.read().json("<JSON_FILE_PATH>");
        houseDS.printSchema();
        Dataset<Row> houseDS2 = houseDS.selectExpr("stack(2,House1, 'House1', House2, 'House2') as (house,HouseName)").na().drop();
        houseDS2.printSchema();
        Dataset<Row> finalHouseDS = houseDS2.select("HouseName","house.House_Id","house.Cover","house.HouseType","house.Facing","house.Region","house.Ru","house.HVAC");
        finalHouseDS.show(false);

    }
}

mck · Accepted Answer · 2020-12-16 07:43:27Z

1

You can use the asterisk notation to select all elements of a struct and expand into columns:

Dataset<Row> df2 = df.select("House1.*")

df2.show(false)
+-----+------+---------------+---------+--------+------+---+
|Cover|Facing|HVAC           |HouseType|House_Id|Region|Ru |
+-----+------+---------------+---------+--------+------+---+
|1.000|South |[FAGF, FPG, HP]|bungalow |1       |YVR   |1  |
+-----+------+---------------+---------+--------+------+---+

edited Dec 16, 2020 at 7:43

answered Dec 16, 2020 at 7:07

mck

42.7k13 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

Convert JSON Data into DataFrame Apache Spark

4 Answers 4

2 Comments

Comments

Using Scala

Using Java

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Using Scala

Using Java

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related