0

I am trying to extract data from a table on a web page with beautiful soup. I want to get the data inside the cells for each row.

I am new to python have tried the following snippet, but it's not working:

import urllib.request
fname = r"C:\Python34\page.htm"
HtmlFile = open(fname, 'r', encoding='utf-8')
source_code = HtmlFile.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_code, 'html.parser')
table = soup.find( "table", {"title":"geoip-demo-results-tbody"} )
rows=list()
for row in table.findAll("tr"):
   rows.append(row)
for tr in rows:
    cols = tr.findAll('td')
    p = col[0].string.strip()
    d = col[1].string.strip()
    print(p)
    print(d)

EDIT:Im getting this error Traceback (most recent call last): File "C:\Python34\scrip.py", line 14, in d = cols[1].text.strip() IndexError: list index out of range" for the row 84.78.229.78ESSantander,
Cantabria,
Cantabria,
Sp‌​ain,
Europe3900143.4647,
-3.8044Orange EspanaOrange Espana this is the html file which generated the above error www.pastebin.com/tQ3Cp5Wj thanks

4
  • There are a couple of minor mistakes in the code. Apart from that the major problem is that the data in the table is not in the html source of the page. It is populated from other ajax calls. Commented Sep 13, 2015 at 14:10
  • but the data is present in the source of page,im saving this page to the folder and then planning to execute python so that i can get the value for individual rows,so ajax might not be a problem Commented Sep 13, 2015 at 14:21
  • Could you please check again that the data is available in the saved source? The table in target is displayed in source as - <tbody id="geoip-demo-results-tbody" > </tbody> Commented Sep 13, 2015 at 14:25
  • yeah please check this,its present in the table im saving the webpage after its populated pastebin.com/FAFijXc0 Commented Sep 13, 2015 at 14:29

1 Answer 1

1
fname = r"F:\Vikas\jobs\temp\page.htm"
HtmlFile = open(fname, 'r', encoding='utf-8')
source_code = HtmlFile.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_code, 'html.parser')

table = soup.find('tbody', id='geoip-demo-results-tbody')
rows = table.find_all('tr')
for tr in rows:
    cols = tr.find_all('td')
    p = cols[0].text.strip()
    d = cols[1].text.strip()
    print(p)
    print(d)
Sign up to request clarification or add additional context in comments.

5 Comments

im getting Index error, Traceback (most recent call last): File "C:\Python34\scrip.py", line 14, in <module> d = cols[1].text.strip() IndexError: list index out of range" for the row <td>84.78.229.78</td><td>ES</td><td>Santander,<br>Cantabria,<br>Cantabria,<br>Spain,<br>Europe</td><td>39001</td><td>43.4647,<br>-3.8044</td><td>Orange Espana</td><td>Orange Espana</td><td></td><td></td> check this paste for html pastebin.com/tQ3Cp5Wj thanks
Please look at the html carefully. There are some rows which do not have more than one, and hence the error for those rows. You will have t handle such scenarios specifically using try/catch blocks.
oh thanks i thought it was a problem in parsing,used try except for the first time and it worked thanks :-)
A suggestion, use except only for what is expected rather than using a generic except section. In your scenario, you should just handle except IndexError.
yeah i have used IndexError exception

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.