1

I've been trying to parse a wikipedia page in Python and have been quite successful using the API.

But, somehow the API documentation seems a bit too skeletal for me to get all the data. As of now, I'm doing a requests.get() call to

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=China&format=json&exintro=1

But, this only returns me the first paragraph. Not the entire page. I've tried to use allpages and search but to no avail. A better explanation of how to get the data from a wiki page would be of real help. All the data and not just the introduction as returned by the previous query.

3 Answers 3

3

You seem to be using the query action to get the content of the page. According to it's api specs it returns only a part of the data. The proper action seems to be query.

Here is a sample

import urllib2
req = urllib2.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
content = req.read()
# content in json - use json or simplejson to get relevant sections.
Sign up to request clarification or add additional context in comments.

2 Comments

I noticed a spelling mistake, its urllib not urlib I fixed in my edit
Thanks @JakobBowyer Did not realize it.
1

Have you considered using Beautiful Soup to extract the content from the page?

While I haven't used this for wikipedia, others have, and having used it to scrape other pages and it is an excellent tool.

1 Comment

Won't scraping take more time than using the API?
0

If someone is lookin for a python3 answer here you go:

import urllib.request
    req = urllib.request.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
    print(req.read())

I'm using python version 3.7.0b4.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.