Extracting data from table on web page

Question

I am trying to extract data from a table on a web page with beautiful soup. I want to get the data inside the cells for each row.

I am new to python have tried the following snippet, but it's not working:

import urllib.request
fname = r"C:\Python34\page.htm"
HtmlFile = open(fname, 'r', encoding='utf-8')
source_code = HtmlFile.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_code, 'html.parser')
table = soup.find( "table", {"title":"geoip-demo-results-tbody"} )
rows=list()
for row in table.findAll("tr"):
   rows.append(row)
for tr in rows:
    cols = tr.findAll('td')
    p = col[0].string.strip()
    d = col[1].string.strip()
    print(p)
    print(d)

EDIT:Im getting this error Traceback (most recent call last): File "C:\Python34\scrip.py", line 14, in d = cols[1].text.strip() IndexError: list index out of range" for the row 84.78.229.78ESSantander,
Cantabria,
Cantabria,
Sp‌ain,
Europe3900143.4647,
-3.8044Orange EspanaOrange Espana this is the html file which generated the above error www.pastebin.com/tQ3Cp5Wj thanks

There are a couple of minor mistakes in the code. Apart from that the major problem is that the data in the table is not in the html source of the page. It is populated from other ajax calls. — Vikas Ojha
– Vikas Ojha, Commented Sep 13, 2015 at 14:10
but the data is present in the source of page,im saving this page to the folder and then planning to execute python so that i can get the value for individual rows,so ajax might not be a problem — BLACKMAMBA
– BLACKMAMBA, Commented Sep 13, 2015 at 14:21
Could you please check again that the data is available in the saved source? The table in target is displayed in source as - <tbody id="geoip-demo-results-tbody" > </tbody> — Vikas Ojha
– Vikas Ojha, Commented Sep 13, 2015 at 14:25
yeah please check this,its present in the table im saving the webpage after its populated pastebin.com/FAFijXc0 — BLACKMAMBA
– BLACKMAMBA, Commented Sep 13, 2015 at 14:29

Vikas Ojha · Accepted Answer · 2015-09-13 14:34:41Z

1

fname = r"F:\Vikas\jobs\temp\page.htm"
HtmlFile = open(fname, 'r', encoding='utf-8')
source_code = HtmlFile.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_code, 'html.parser')

table = soup.find('tbody', id='geoip-demo-results-tbody')
rows = table.find_all('tr')
for tr in rows:
    cols = tr.find_all('td')
    p = cols[0].text.strip()
    d = cols[1].text.strip()
    print(p)
    print(d)

answered Sep 13, 2015 at 14:34

Vikas Ojha

7,0106 gold badges24 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

BLACKMAMBA Over a year ago

im getting Index error, Traceback (most recent call last): File "C:\Python34\scrip.py", line 14, in <module> d = cols[1].text.strip() IndexError: list index out of range" for the row <td>84.78.229.78</td><td>ES</td><td>Santander,<br>Cantabria,<br>Cantabria,<br>Spain,<br>Europe</td><td>39001</td><td>43.4647,<br>-3.8044</td><td>Orange Espana</td><td>Orange Espana</td><td></td><td></td> check this paste for html pastebin.com/tQ3Cp5Wj thanks

Vikas Ojha Over a year ago

Please look at the html carefully. There are some rows which do not have more than one, and hence the error for those rows. You will have t handle such scenarios specifically using try/catch blocks.

BLACKMAMBA Over a year ago

oh thanks i thought it was a problem in parsing,used try except for the first time and it worked thanks :-)

Vikas Ojha Over a year ago

A suggestion, use except only for what is expected rather than using a generic except section. In your scenario, you should just handle except IndexError.

BLACKMAMBA Over a year ago

yeah i have used IndexError exception

Collectives™ on Stack Overflow

Extracting data from table on web page

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related