Extract text from Wikipedia html using Python

Question

I am trying to look for a way to extract the main text of a Wikipedia article using python. I am aware of the "wikipedia" library, but in my case I already have downloaded the html page, and I just need to extract the text. I can't use that library because I need to use wikipedia page html that was downloaded some years ago so I can't download it from scratch.

Is there an "off the shelf" solution that I can use for this purpose?

As @CodeNinja says, BeatifulSoup is a great tool, you can follow the tutorial Easy Web Scraping with Python to learn more about this. — Crisoforo Gaspar
– Crisoforo Gaspar, Commented Oct 9, 2014 at 18:14
I've answered something similar here : stackoverflow.com/questions/23671560/… — Vipul
– Vipul, Commented Oct 9, 2014 at 18:38
I know about beautifulsoup and I have used it already in the past. What I was looking for is something that doesn't require me to look at which tags to consider, and also that maybe removes also all the wiki formatting as the references ([1],...) for instance. — papafe
– papafe, Commented Oct 9, 2014 at 22:11

CodeNinja · Accepted Answer · 2014-10-09 17:59:08Z

2

try BeautifulSoup:

from bs4 import BeautifulSoup
import requests

respond = requests.get("http://pl.wikipedia.org/wiki/StackOverflow")
soup = BeautifulSoup(respond.text)
l = soup.find_all('p')
print l[0].text

answered Oct 9, 2014 at 17:59

CodeNinja

1,1792 gold badges14 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Cătălin George Feștilă · Accepted Answer · 2019-06-03 18:43:04Z

0

You can use this python module:

pip install wikipedia

answered Jun 3, 2019 at 18:43

Cătălin George Feștilă

1,57932 silver badges51 bronze badges

Collectives™ on Stack Overflow

Extract text from Wikipedia html using Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related