How to parse a wikipedia page in Python?

Question

I've been trying to parse a wikipedia page in Python and have been quite successful using the API.

But, somehow the API documentation seems a bit too skeletal for me to get all the data. As of now, I'm doing a requests.get() call to

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=China&format=json&exintro=1

But, this only returns me the first paragraph. Not the entire page. I've tried to use allpages and search but to no avail. A better explanation of how to get the data from a wiki page would be of real help. All the data and not just the introduction as returned by the previous query.

Jakob Bowyer · Accepted Answer · 2012-05-13 11:11:37Z

3

You seem to be using the query action to get the content of the page. According to it's api specs it returns only a part of the data. The proper action seems to be query.

Here is a sample

import urllib2
req = urllib2.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
content = req.read()
# content in json - use json or simplejson to get relevant sections.

edited May 13, 2012 at 11:11

Jakob Bowyer

34.8k8 gold badges80 silver badges93 bronze badges

answered May 13, 2012 at 10:41

Senthil Kumaran

57.3k15 gold badges99 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jakob Bowyer Over a year ago

I noticed a spelling mistake, its urllib not urlib I fixed in my edit

Senthil Kumaran Over a year ago

Thanks @JakobBowyer Did not realize it.

carboncrank · Accepted Answer · 2012-05-13 10:39:29Z

1

Have you considered using Beautiful Soup to extract the content from the page?

While I haven't used this for wikipedia, others have, and having used it to scrape other pages and it is an excellent tool.

answered May 13, 2012 at 10:39

carboncrank

715 bronze badges

1 Comment

Hick Over a year ago

Won't scraping take more time than using the API?

Peter Girnus · Accepted Answer · 2018-06-01 17:27:21Z

0

If someone is lookin for a python3 answer here you go:

import urllib.request
    req = urllib.request.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
    print(req.read())

I'm using python version 3.7.0b4.

answered Jun 1, 2018 at 17:27

Peter Girnus

2,7691 gold badge22 silver badges24 bronze badges

Collectives™ on Stack Overflow

How to parse a wikipedia page in Python?

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related