2

I have test JSON data at following link

http://developer.trade.gov/api/market-research-library.json

When I am trying to read schema directly from it in following manner

public void readJsonFormat() {
        Dataset<Row> people = spark.read().json("market-research-library.json");
        people.printSchema();
    }

It is giving me error as

root
 |-- _corrupt_record: string (nullable = true)

If it is malformed, how to convert it into format as expected by Spark.

3
  • 1
    Each json object should be in a single line for spark to create a dataframe out of it.. Commented Sep 13, 2017 at 9:29
  • When its big file in such format,what are the options @philantrovert Commented Sep 13, 2017 at 9:45
  • The file that you have provided has only one json object. Will that be the case always. If yes then you can just read it as an rdd and do a replaceAll for newline character \n. Commented Sep 13, 2017 at 9:55

3 Answers 3

3

Converting your json to single line.

Or set option("multiLine", true) to allow multiply line json.

Sign up to request clarification or add additional context in comments.

2 Comments

Dataset<Row> people = spark.read().option("multiLine", true).json("market-research-library.json") It is still giving error.
To add, this will work for only one record per file.
1

If this is the only json you would like to convert to dataframe then I suggest you to go with wholeTextFiles api. Since the json is not in spark readable format, you can convert it to spark readable format only when whole of the data is read as one parameter and wholeTextFiles api does that.

Then you can replace the linefeed and spaces from the json string. And finally you should have required dataframe.

sqlContext.read.json(sc.wholeTextFiles("path to market-research-library.json file").map(_._2.replace("\n", "").replace(" ", "")))

You should have your required dataframe with following schema

root
 |-- basePath: string (nullable = true)
 |-- definitions: struct (nullable = true)
 |    |-- Report: struct (nullable = true)
 |    |    |-- properties: struct (nullable = true)
 |    |    |    |-- click_url: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- country: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- description: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- expiration_date: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- id: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- industry: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- report_type: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- source_industry: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- title: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- url: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |-- host: string (nullable = true)
 |-- info: struct (nullable = true)
 |    |-- description: string (nullable = true)
 |    |-- title: string (nullable = true)
 |    |-- version: string (nullable = true)
 |-- paths: struct (nullable = true)
 |    |-- /market_research_library/search: struct (nullable = true)
 |    |    |-- get: struct (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- parameters: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |    |-- format: string (nullable = true)
 |    |    |    |    |    |-- in: string (nullable = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |    |-- required: boolean (nullable = true)
 |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- responses: struct (nullable = true)
 |    |    |    |    |-- 200: struct (nullable = true)
 |    |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |    |-- schema: struct (nullable = true)
 |    |    |    |    |    |    |-- items: struct (nullable = true)
 |    |    |    |    |    |    |    |-- $ref: string (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- tags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |-- produces: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- schemes: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- swagger: string (nullable = true)

Comments

1

The format expected by spark is JSONL(JSON lines) which is not the standard JSON. Got to know this from here. Here's a small python script to convert your json to expected format:

import jsonlines
import json


with open('C:/Users/ak/Documents/card.json', 'r') as f:
    json_data = json.load(f)

with jsonlines.open('C:/Users/ak/Documents/card_lines.json', 'w') as writer:
    writer.write_all(json_data)

Then you can access the file in your program as you have written in your code.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.