0

I want to convert my json data into a dataframe to make it more manageable.

Command to get data into java

Dataset<Row> df = spark.read()
   .option("multiline",true)
   .option("mode", "PERMISSIVE")
   .json("hdfs://hd-master:9820/houseInformation.txt");

Data

{
 "House1": {
   "House_Id": "1",
   "Cover": "1.000",
   "HouseType": "bungalow",
   "Facing": "South",
   "Region": "YVR",
   "Ru": "1",
   "HVAC": [
     "FAGF",
     "FPG",
     "HP"
   ]
 },
 "House2" : {...},
 "House3" : {...},
}

If it's possible I would like to remove the key "House1" and then convert the rest of the data into a df. If not then it's fine too.

but ideally something like this is my desired output

HouseName  House_Id  Cover   HouseType Facing Region Ru HVAC
 House1       1     1.000     bungalow  South   YVR   1  []   

Schema

root
 |-- House1: struct (nullable = true)
 |    |-- Cover: string (nullable = true)
 |    |-- Facing: string (nullable = true)
 |    |-- HVAC: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- HouseType: string (nullable = true)
 |    |-- House_Id: string (nullable = true)
 |    |-- Region: string (nullable = true)
 |    |-- Ru: string (nullable = true)

Simply printing

df.select(functions.col("House1")).show(false);

Returns this

+----------------------------------------------------+
|House1                                              |
+----------------------------------------------------+
|[1.000, South, [FAGF, FPG, HP], bungalow, 1, YVR, 1]|
+----------------------------------------------------+
1
  • house column is of type struct, to see all columns of struct Try - df.select(functions.col("House1.*")).show(false); Commented Dec 16, 2020 at 5:12

4 Answers 4

4

Your input json file should be in jsonlines format (https://jsonlines.org/) whereas this is a single json document.

If all these "house" elements had the same key and if the file was in jsonlines format, they would automatically be separate rows in your Dataframe. But here the keys here are different i.e. House1, House2, etc and hence it is seen as a single record. There are an arbitrary number of "houses" but each with a different key and so they are treated as different columns.

If you write complex transformation to achieve your desired outcome, it would not be efficient. A single json record is not going to be parallelised by spark. There will be a single executor doing all the work and your dataframe would have an arbitrarily large number of columns. You may as well not use spark for this use case.

Therefore, the more sensible way IMO to fix this would be to fix your source document. This could mean possibly preprocessing your source file to covert it to jsonlines.

Sign up to request clarification or add additional context in comments.

2 Comments

Even though all the other answers gave me the answer I desired, I went with this one. It was the most efficient. I modified my file into a proper format and informed my professor of the changes.
One more tip that may help someone. Sometimes you may get an input json file with a json array. This can be easily converted to jsonlines by removing the first and last characters in the file i.e. removing [ at the start of the file and ] at the end of the file
2

As House1 column is of type struct to extract all columns from struct type use House1.*.

Try below code.

df
.select(functions.col("House1.*"))
.show(false)

Comments

2

Using Scala

1. Reading the House JSON data. Please note that, I am giving one json input row in single line

{ "House1": { "House_Id": "1", "Cover": "1.000", "HouseType": "bungalow", "Facing": "South", "Region": "YVR", "Ru": "1", "HVAC": [ "FAGF", "FPG", "HP" ] } }
{ "House2": { "House_Id": "2", "Cover": "1.000", "HouseType": "bungalow", "Facing": "North", "Region": "YVR", "Ru": "1", "HVAC": [ "FAGF", "FPG", "HP" ] } }

Code

val houseDS = spark.read.json("<JSON_FILE_PATH>");
houseDS.printSchema
root
 |-- House1: struct (nullable = true)
 |    |-- Cover: string (nullable = true)
 |    |-- Facing: string (nullable = true)
 |    |-- HVAC: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- HouseType: string (nullable = true)
 |    |-- House_Id: string (nullable = true)
 |    |-- Region: string (nullable = true)
 |    |-- Ru: string (nullable = true)
 |-- House2: struct (nullable = true)
 |    |-- Cover: string (nullable = true)
 |    |-- Facing: string (nullable = true)
 |    |-- HVAC: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- HouseType: string (nullable = true)
 |    |-- House_Id: string (nullable = true)
 |    |-- Region: string (nullable = true)
 |    |-- Ru: string (nullable = true)

houseDS.show(false)
+----------------------------------------------------+----------------------------------------------------+
|House1                                              |House2                                              |
+----------------------------------------------------+----------------------------------------------------+
|[1.000, South, [FAGF, FPG, HP], bungalow, 1, YVR, 1]|null                                                |
|null                                                |[1.000, North, [FAGF, FPG, HP], bungalow, 2, YVR, 1]|
+----------------------------------------------------+----------------------------------------------------+

2. We are using the stack() function to separate multiple columns into rows. Here is the stack function syntax: stack(n, expr1, ..., exprk) - Separates expr1, ..., exprk into n rows.

val houseDS2 = houseDS.select(expr("stack(2,House1, 'House1', House2, 'House2') as (house,HouseName)")).na.drop
houseDS2.printSchema
root
 |-- house: struct (nullable = true)
 |    |-- Cover: string (nullable = true)
 |    |-- Facing: string (nullable = true)
 |    |-- HVAC: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- HouseType: string (nullable = true)
 |    |-- House_Id: string (nullable = true)
 |    |-- Region: string (nullable = true)
 |    |-- Ru: string (nullable = true)
 |-- HouseName: string (nullable = true)

3. Then selecting all the required columns from above houseDS2 DataSet

val finalHouseDS = houseDS2.select("HouseName","house.House_Id","house.Cover","house.HouseType","house.Facing","house.Region","house.Ru","house.HVAC")
finalHouseDS.show(false)

Your expected output

+---------+--------+-----+---------+------+------+---+---------------+
|HouseName|House_Id|Cover|HouseType|Facing|Region|Ru |HVAC           |
+---------+--------+-----+---------+------+------+---+---------------+
|House1   |1       |1.000|bungalow |South |YVR   |1  |[FAGF, FPG, HP]|
|House2   |2       |1.000|bungalow |North |YVR   |1  |[FAGF, FPG, HP]|
+---------+--------+-----+---------+------+------+---+---------------+

You can implement similarly in Java. Please let me know if you face any performance issue for larger DataSet.

Using Java

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.functions.*;

public class ParseJson {
    public static void main(String[] args) {
        System.setProperty("hadoop.home.dir", "D:\\Software\\Hadoop");

        SparkSession spark = SparkSession
                .builder()
                .appName("Testing")
                .master("local[*]")
                .getOrCreate();
        // Read json data

        Dataset<Row> houseDS = spark.read().json("<JSON_FILE_PATH>");
        houseDS.printSchema();
        Dataset<Row> houseDS2 = houseDS.selectExpr("stack(2,House1, 'House1', House2, 'House2') as (house,HouseName)").na().drop();
        houseDS2.printSchema();
        Dataset<Row> finalHouseDS = houseDS2.select("HouseName","house.House_Id","house.Cover","house.HouseType","house.Facing","house.Region","house.Ru","house.HVAC");
        finalHouseDS.show(false);

    }
}

Comments

1

You can use the asterisk notation to select all elements of a struct and expand into columns:

Dataset<Row> df2 = df.select("House1.*")

df2.show(false)
+-----+------+---------------+---------+--------+------+---+
|Cover|Facing|HVAC           |HouseType|House_Id|Region|Ru |
+-----+------+---------------+---------+--------+------+---+
|1.000|South |[FAGF, FPG, HP]|bungalow |1       |YVR   |1  |
+-----+------+---------------+---------+--------+------+---+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.