1

I am trying to look for a way to extract the main text of a Wikipedia article using python. I am aware of the "wikipedia" library, but in my case I already have downloaded the html page, and I just need to extract the text. I can't use that library because I need to use wikipedia page html that was downloaded some years ago so I can't download it from scratch.

Is there an "off the shelf" solution that I can use for this purpose?

3
  • As @CodeNinja says, BeatifulSoup is a great tool, you can follow the tutorial Easy Web Scraping with Python to learn more about this. Commented Oct 9, 2014 at 18:14
  • I've answered something similar here : stackoverflow.com/questions/23671560/… Commented Oct 9, 2014 at 18:38
  • I know about beautifulsoup and I have used it already in the past. What I was looking for is something that doesn't require me to look at which tags to consider, and also that maybe removes also all the wiki formatting as the references ([1],...) for instance. Commented Oct 9, 2014 at 22:11

2 Answers 2

2

try BeautifulSoup:

from bs4 import BeautifulSoup
import requests

respond = requests.get("http://pl.wikipedia.org/wiki/StackOverflow")
soup = BeautifulSoup(respond.text)
l = soup.find_all('p')
print l[0].text
Sign up to request clarification or add additional context in comments.

Comments

0

You can use this python module:

pip install wikipedia

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.