1

I am working with a non-nested json file, the data is from reddit. I am trying to convert it to csv file using python. Each row is not having the same fields and therefore keep getting the error as:

JSONDecodeError: Extra data: line 2 column 1

Here is the code:

import csv
import json
import os

os.chdir('c:\\Users\\Desktop')
infile = open("data.json", "r")
outfile = open("outputfile.csv", "w")

writer = csv.writer(outfile)

for row in json.loads(infile.read()):
    writer.writerow(row)

Here are few lines from the data:

{"author":"i_had_an_apostrophe","body":"\"It's not your fault.\"","author_flair_css_class":null,"link_id":"t3_5c0rn0","subreddit":"AskReddit","created_utc":1478736000,"subreddit_id":"t5_2qh1i","parent_id":"t1_d9t3q4d","author_flair_text":null,"id":"d9tlp0j"}
{"id":"d9tlp0k","author_flair_text":null,"parent_id":"t1_d9tame6","link_id":"t3_5c1efx","subreddit":"technology","created_utc":1478736000,"subreddit_id":"t5_2qh16","author":"willliam971","body":"9/11 inside job??","author_flair_css_class":null}
{"created_utc":1478736000,"subreddit_id":"t5_2qur2","link_id":"t3_5c44bz","subreddit":"excel","author":"excelevator","author_flair_css_class":"points","body":"Have you tried stepping through the code to analyse the values at each step?\n\n","author_flair_text":"442","id":"d9tlp0l","parent_id":"t3_5c44bz"}
{"created_utc":1478736000,"subreddit_id":"t5_2tycb","link_id":"t3_5c384j","subreddit":"OldSchoolCool","author":"10minutes_late","author_flair_css_class":null,"body":"**Thanks Hillary**","author_flair_text":null,"id":"d9tlp0m","parent_id":"t3_5c384j"}

I am thinking of getting all the fields that are available in csv file (as header) and if data is not available for that particular field, just fill it with NA.

5
  • 3
    What is your question? Commented Jan 27, 2017 at 0:14
  • Where do you find which columns to use in your csv file? Commented Jan 27, 2017 at 1:07
  • @DYZ My question is to write the python code in a way that can take all the available fields from all rows and make a csv which will have nulls if data is not available for that field. Commented Jan 27, 2017 at 1:09
  • @RoryDaulton That I am not sure of and so I was thinking of taking all the available fields from all rows and create headers in csv files and put nulls if data is not available for that particular field for that row. Commented Jan 27, 2017 at 1:11
  • Can you post your actual JSON data in a gist? The lines you quoted are not valid JSON (they're just four JSON objects, each on their own line). From the error it looks like the problem is in the read step, not the write step. Commented Jan 27, 2017 at 6:29

4 Answers 4

1

Your question is missing information about what you're trying to accomplish, so I'm guessing about them. Note that csv files don't use "nulls" to represent missing fields, they just have delimiters with nothing between them, like 1,2,,4,5 which has no third field value.

Also how you open csv files varys depending on whether you're using Python 2 or 3. The code below is for Python 3.

#!/usr/bin/env python3
import csv
import json
import os

os.chdir('c:\\Users\\Desktop')
with open('sampledata.json', 'r', newline='') as infile:
    data = json.loads(infile.read())

# determine all the keys present, which will each become csv fields
fields = list(set(key for row in data for key in row))

with open('outputfile.csv', 'w', newline='') as outfile:
    writer = csv.DictWriter(outfile, fields)
    writer.writeheader()
    writer.writerows(row for row in data)
Sign up to request clarification or add additional context in comments.

2 Comments

It is still not able to figure out all the fields, got the error: JSONDecodeError("Extra data", s, end)
That may be because the JSON data shown in your question isn't valid. JSON objects can't appear one-right-after-the-other like that, so the JSONDecoder is complaining. For testing purposes, I enclosed the group of them all in [] bracket characters and added a comma between each. If your data is actually in exactly the format you describe, one-object-per-line, you can work around the issue by calling json.loads() for each row of the input file and creating the data list that way.
0

I suggest you to use the csv.DictWriter class. That class needs an file to write to and a list of fieldnames (I've figured out from your data sample).

import csv
import json
import os

fieldnames = [
    "author", "author_flair_css_class", "author_flair_text", "body",
    "created_utc", "id", "link_id", "parent_id", "subreddit",
    "subreddit_id"
]

os.chdir('c:\\Users\\Desktop')
with open("data.json", "r") as infile:
    outfile = open("outputfile.csv", "w")

    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()

    for row in infile:
        row_dict = json.loads(row)
        writer.writerow(row_dict)

    outfile.close()

2 Comments

This works fine for the above four lines of data, but when i run for the whole data file, i got UnicodeEncodeError: 'charmap' codec can't encode character '\u03a9' in position 46: character maps to <undefined>
That error is caused by the file encoding, I think that by specifying the file encoding, e.g. with open('data.json', 'r', encoding='utf-8') as infile:, could fix that. (encoding keyword is available in py3k). docs.python.org/3/library/functions.html#open
0

You can write a little function to build the rows for you, extracting data only where it is available and inserting None if it is not. What you called header, I called schema. Get all the fields, remove duplicates and sort, then build records based on the full set of fields and insert those records into the csv.

import csv
import json

def build_record(row, schema):
    values = []
    for field in schema:
        if field in row:
            values.append(row[field])
        else:
            values.append(None)
    return tuple(values)

infile = open("data.json", "r").readlines()
outfile = open("outputfile.csv", "wb")
writer = csv.writer(outfile)

rows = [json.loads(row.strip()) for row in infile]
schema = tuple(sorted(list(set([k for r in rows for k in r.keys()]))))
records = [build_record(r, schema) for r in rows]

writer.writerow(schema)

for rec in records:
    writer.writerow(rec)
outfile.close()

1 Comment

I got the TypeError: a bytes-like object is required, not 'str'
0

You can use Pandas to fill in the blanks for you (you may need to pip install pandas first):

import pandas as pd
import os

# load json
os.chdir('c:\\Users\\Desktop')
with open("data.json", "r") as infile:

    # read data into a Pandas DataFrame
    df = pd.read_json(infile)

# use Pandas to write to CSV
df.to_csv("myfile.csv")

3 Comments

Getting ValueError: Trailing data
Must be the form of the JSON. You can also just parse it separately and then read the dictionary: df = pd.DataFrame.from_dict(json.load(infile))
As I said in my comment above, we'd really need to see the actual JSON to help you fix JSON read errors.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.