Pyspark: How to convert a spark dataframe to json and save it as json file?

Question

I am trying to convert my pyspark sql dataframe to json and then save as a file.

df_final = df_final.union(join_df)

df_final contains the value as such:

I tried something like this. But it created a invalid json.

df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)

{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}

My expected file should have data as below:

[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]

try df.toJSON()

Bala
– Bala

2018-11-22 10:09:27 +00:00
Commented Nov 22, 2018 at 10:09 — Bala
– Bala, Commented Nov 22, 2018 at 10:09

Sahil Desai · Accepted Answer · 2018-11-23 07:40:48Z

9

For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.

df_final.coalesce(1).write.format('json').save('/path/file_name.json')

and still you want to convert your datafram into json then you can used df_final.toJSON().

answered Nov 23, 2018 at 7:40

Sahil Desai

3,6964 gold badges22 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Shankar Panda Over a year ago

Yeah, but it stores data line by line {"Variable":"Col1","Min":"20","Max":"30"} {"Variable":"Col2","Min":"25,"Max":"40"} instead it should be separated by , and enclosed with square braces

utkarshgupta137 · Accepted Answer · 2021-02-15 07:29:12Z

5

Here is how you can do the equivalent of json.dump for a dataframe with PySpark 1.3+.

df_list_of_jsons = df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_list_of_jsons]
df_json = json.dumps(df_list_of_dicts)
sc.parallelize([df_json]).repartition(1).cache().saveAsTextFile("<HDFS_PATH>")

Note this will result in the whole dataframe being loaded into the driver memory, so this is only recommended for small dataframe.

answered Feb 15, 2021 at 7:29

utkarshgupta137

1915 silver badges4 bronze badges

1 Comment

UsAndRufus Over a year ago

I tried so many different methods, and this was what finally worked for me. Thank you!!

OmG · Accepted Answer · 2018-11-22 09:40:29Z

2

A solution can be using collect and then using json.dump:

import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
    json.dump(data, outfile)

answered Nov 22, 2018 at 9:40

OmG

19k13 gold badges69 silver badges96 bronze badges

2 Comments

Shankar Panda Over a year ago

Actually this correct but it is not creating the file directly in hdfs. It creates on the container where the script runs

paulochf Over a year ago

It uses driver memory, so it's not recommended.

chilun · Accepted Answer · 2018-11-23 02:49:23Z

1

If you want to use spark to process result as json files, I think that your output schema is right in hdfs.

And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :

with open('data.json') as f:
  data = json.load(f)

You should try to read data line by line:

data = []
with open("data.json",'r') as datafile:
  for line in datafile:
    data.append(json.loads(line))

and you can use pandas to create dataframe :

df = pd.DataFrame(data)

answered Nov 23, 2018 at 2:49

chilun

3026 silver badges19 bronze badges

2 Comments

Fahad Ashraf Over a year ago

I was trying to understand why there was an answer that was related to reading the json file rather than writing out to it. I understand now, the json format that spark writes out is not comma delimited, and so it must be read back in a little differently. Thank you so much for this

chilun Over a year ago

@FahadAshraf Glad that helped. And yes, the json format that spark writes out is not comma delimited. It's very confuse when reading json file which created from spark (or others hdfs schema) at first time.

Collectives™ on Stack Overflow

Pyspark: How to convert a spark dataframe to json and save it as json file?

4 Answers 4

1 Comment

1 Comment

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related