Export JSON to CSV using Python

Question

I wrote a code to extract some information from a website. the output is in JSON and I want to export it to CSV. So, I tried to convert it to a pandas dataframe and then export it to CSV in pandas. I can print the results but still, it doesn't convert the file to a pandas dataframe. Do you know what the problem with my code is?

# -*- coding: utf-8 -*-
# To create http request/session 
import requests
import re, urllib
import pandas as pd
from BeautifulSoup import BeautifulSoup

url = "https://www.indeed.com/jobs? 
q=construction%20manager&l=Houston&start=10"

# create session
s = requests.session()
html = s.get(url).text

# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + 
urllib.quote(job_ids)
# do Ajax request and convert the response to json 
ajax_content = s.get(ajax_url).json()
print(ajax_content)
#Convert to pandas dataframe
df = pd.read_json(ajax_content)
#Export to CSV
df.to_csv("c:\\users\\Name\desktop\\newcsv.csv")

The error message is:

Traceback (most recent call last):

File "C:\Users\Mehrdad\Desktop\Indeed 06.py", line 21, in df = pd.read_json(ajax_content)

File "c:\python27\lib\site-packages\pandas\io\json\json.py", line 408, in read_json path_or_buf, encoding=encoding, compression=compression,

File "c:\python27\lib\site-packages\pandas\io\common.py", line 218, in get_filepath_or_buffer raise ValueError(msg.format(_type=type(filepath_or_buffer)))

ValueError: Invalid file path or buffer object type:

What is your error?

aghd
– aghd

2019-03-03 03:36:27 +00:00
Commented Mar 3, 2019 at 3:36 — aghd
– aghd, Commented Mar 3, 2019 at 3:36
I just added the error message. Thanks. @AminGhaderi

Mehrdad Fonooni
– Mehrdad Fonooni

2019-03-03 03:57:34 +00:00
Commented Mar 3, 2019 at 3:57 — Mehrdad Fonooni
– Mehrdad Fonooni, Commented Mar 3, 2019 at 3:57

J. Taylor · Accepted Answer · 2019-03-03 05:36:24Z

1

The problem was that nothing was going into the dataframe when you called read_json() because it was a nested JSON dict:

import requests
import re, urllib
import pandas as pd
from pandas.io.json import json_normalize

url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston&start=10"

s = requests.session()
html = s.get(url).text

job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)

ajax_content= s.get(ajax_url).json()
df = json_normalize(ajax_content).transpose()
df.to_csv('your_output_file.csv')

Note that I called json_normalize() to collapse the nested columns from the JSON. I also called transpose() so that the rows were labelled with the job ID rather than columns. This will give you a dataframe that looks like this:

0079ccae458b4dcf    <p><b>Company Environment: </b></p><p>Planet F...
0c1ab61fe31a5c62    <p><b>Commercial Construction Project Manager<...
0feac44386ddcf99    <div><div>Trendmaker Homes is currently seekin...
...

It's not really clear what your expected output is, though ... what are you expecting the DataFrame/CSV file to look like?. If you actually were looking for just a single row/Series with the job ID's as column labels, just remove the call to transpose()

edited Mar 3, 2019 at 5:36

answered Mar 3, 2019 at 4:59

J. Taylor

4,9455 gold badges40 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mehrdad Fonooni Over a year ago

Thanks for your Code. Later, I want to use NLTK (Natural Language Toolkit) to segment the text into individual sentences. Still, I'm not sure about the kind of output I need for the purpose. The issue is that now I lost most of the contents of the texts. If you print the original code you'll see them all. Is there any way to recover them? Do you have any suggestion for the output?

J. Taylor Over a year ago

I'm not sure what you mean by you "lost most of the contents of the texts". If you are talking about some of the text being missing when you print() the dataframe, that is just because for long text fields, it only shows the first part as a summary/excerpt. But all of the text from the original JSON is actually stored internally (nothing is missing). It's only not showing in the printed representation of the dataframe.

J. Taylor Over a year ago

As far as how to segment the text into sentences, and other language processing techniques, you should start a separate question for those. But in general, I would recommend checking out the documentation for nltk.sent_tokenize() if you want to segment into sentences. You would want to use an HTML parsing library like BeautifulSoup or lxml to extract plaintext first though, because the sentence tokenizers generally can't work with markup.

Collectives™ on Stack Overflow

Export JSON to CSV using Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related