1

I wrote a code to extract some information from a website. the output is in JSON and I want to export it to CSV. So, I tried to convert it to a pandas dataframe and then export it to CSV in pandas. I can print the results but still, it doesn't convert the file to a pandas dataframe. Do you know what the problem with my code is?

# -*- coding: utf-8 -*-
# To create http request/session 
import requests
import re, urllib
import pandas as pd
from BeautifulSoup import BeautifulSoup

url = "https://www.indeed.com/jobs? 
q=construction%20manager&l=Houston&start=10"

# create session
s = requests.session()
html = s.get(url).text

# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + 
urllib.quote(job_ids)
# do Ajax request and convert the response to json 
ajax_content = s.get(ajax_url).json()
print(ajax_content)
#Convert to pandas dataframe
df = pd.read_json(ajax_content)
#Export to CSV
df.to_csv("c:\\users\\Name\desktop\\newcsv.csv")

The error message is:

Traceback (most recent call last):

File "C:\Users\Mehrdad\Desktop\Indeed 06.py", line 21, in df = pd.read_json(ajax_content)

File "c:\python27\lib\site-packages\pandas\io\json\json.py", line 408, in read_json path_or_buf, encoding=encoding, compression=compression,

File "c:\python27\lib\site-packages\pandas\io\common.py", line 218, in get_filepath_or_buffer raise ValueError(msg.format(_type=type(filepath_or_buffer)))

ValueError: Invalid file path or buffer object type:

2
  • What is your error? Commented Mar 3, 2019 at 3:36
  • I just added the error message. Thanks. @AminGhaderi Commented Mar 3, 2019 at 3:57

1 Answer 1

1

The problem was that nothing was going into the dataframe when you called read_json() because it was a nested JSON dict:

import requests
import re, urllib
import pandas as pd
from pandas.io.json import json_normalize

url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston&start=10"

s = requests.session()
html = s.get(url).text

job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)

ajax_content= s.get(ajax_url).json()
df = json_normalize(ajax_content).transpose()
df.to_csv('your_output_file.csv')

Note that I called json_normalize() to collapse the nested columns from the JSON. I also called transpose() so that the rows were labelled with the job ID rather than columns. This will give you a dataframe that looks like this:

0079ccae458b4dcf    <p><b>Company Environment: </b></p><p>Planet F...
0c1ab61fe31a5c62    <p><b>Commercial Construction Project Manager<...
0feac44386ddcf99    <div><div>Trendmaker Homes is currently seekin...
...

It's not really clear what your expected output is, though ... what are you expecting the DataFrame/CSV file to look like?. If you actually were looking for just a single row/Series with the job ID's as column labels, just remove the call to transpose()

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your Code. Later, I want to use NLTK (Natural Language Toolkit) to segment the text into individual sentences. Still, I'm not sure about the kind of output I need for the purpose. The issue is that now I lost most of the contents of the texts. If you print the original code you'll see them all. Is there any way to recover them? Do you have any suggestion for the output?
I'm not sure what you mean by you "lost most of the contents of the texts". If you are talking about some of the text being missing when you print() the dataframe, that is just because for long text fields, it only shows the first part as a summary/excerpt. But all of the text from the original JSON is actually stored internally (nothing is missing). It's only not showing in the printed representation of the dataframe.
As far as how to segment the text into sentences, and other language processing techniques, you should start a separate question for those. But in general, I would recommend checking out the documentation for nltk.sent_tokenize() if you want to segment into sentences. You would want to use an HTML parsing library like BeautifulSoup or lxml to extract plaintext first though, because the sentence tokenizers generally can't work with markup.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.