84

I am using Python 2.7 + BeautifulSoup 4.3.2.

I am trying to use Python and BeautifulSoup to pick up information on a webpage. Because the webpage is in the company website and requires login and redirection, I copied the target page's source code page into a file and saved it as “example.html” in C:\ for the convenience of practicing.

This is a part of the original code:

<tr class="ghj">
    <td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span><a href="./membercity.php?mode=view&amp;u=12563">port_new_cape</a></td>
    <td class="position"><a href="./search.php?id=12563&amp;sr=positions" title="Search positions">452</a></td>
    <td class="details"><div>South</div></td>
    <td>May 09, 1997</td>
    <td>Jan 23, 2009 12:05 pm&nbsp;</td>
</tr>

The code I worked out so far is:

from bs4 import BeautifulSoup
import re
import urllib2

url = "C:\example.html"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

cities = soup.find_all('span', {'class' : 'city-sh'})

for city in cities:
print city

This is just the first stage of testing, so it's somewhat incomplete.

However, when I run it, it gives an error message. Seems it’s improper to use urllib2.urlopen to open a local file.

 Traceback (most recent call last):
   File "C:\Python27\Testing.py", line 8, in <module>
     page = urllib2.urlopen(url)
   File "C:\Python27\lib\urllib2.py", line 127, in urlopen
     return _opener.open(url, data, timeout)
   File "C:\Python27\lib\urllib2.py", line 404, in open
     response = self._open(req, data)
   File "C:\Python27\lib\urllib2.py", line 427, in _open
     'unknown_open', req)
   File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
     result = func(*args)
   File "C:\Python27\lib\urllib2.py", line 1247, in unknown_open
     raise URLError('unknown url type: %s' % type)
 URLError: <urlopen error unknown url type: c>

How can I practice using a local file?

2
  • 10
    please try: soup = BeautifulSoup(open(url).read()) and note url should be url = r"C:\example.html" otherwise `\` in url act as escape character. Commented Feb 5, 2014 at 7:17
  • 2
    thank you, Chandan. i change it to url = r"C:\example.html" page = open(url) soup = BeautifulSoup(page.read()), and it works. the "urllib2.url" is useless here in my case. Commented Feb 5, 2014 at 7:29

3 Answers 3

143

The best way to open a local file with BeautifulSoup is to pass it a file handler directly. http://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

from bs4 import BeautifulSoup

with open("C:\\example.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

for city in soup.find_all('span', {'class' : 'city-sh'}):
    print(city)
Sign up to request clarification or add additional context in comments.

5 Comments

It shows the warning. The answer is here
On Macs, soup = BeautifulSoup(open("/path/to/your/file.html"), "html.parser")
The best way? It shows ResourceWarning: unclosed file
@MatejJ Thanks for heads up. Looks like they updated the docs / how it works. Now it doesn't handle closing for you. Updated to match new documentation using context manager.
For Unicode html files: with open(filename, encoding='utf-8') as fp:
45

With Chandan's help, the problem has been solved. All the credits shall go to him. :)

the "urllib2.url" is useless here.

from bs4 import BeautifulSoup
import re
# import urllib2

url = "C:\example.html"
page = open(url)
soup = BeautifulSoup(page.read())

cities = soup.find_all('span', {'class' : 'city-sh'})

for city in cities:
    print city

3 Comments

If urllib2.url is useless, then do you still need the import urllib2?
I would replace . soup = BeautifulSoup(page.read()) with soup = BeautifulSoup(page.read(), features="lxml") in order to properly be able to navigate the DOM.
@Haddock-san, I have recent findings at stackoverflow.com/questions/58300101/…, you may want to have a look.
6

You can try using lxml parser also. Here is an example for your html data.

from lxml.html import fromstring
import lxml.html as PARSER

data = open('example.html').read()
root = PARSER.fromstring(data)

for ele in root.getiterator():
    if ele.tag == "td":
        print ele.text_content()

o/p: port_new_cape 452 South May 09, 1997 Jan 23, 2009 12:05 pm 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.