search for a string inside html source with python (3.3.1)

Question

I am working on a project to get information from a web page. in the html source I have the following:

Resultado de Busca: Foram encontrados 264 casais

I need to get the number between "encontrados" and "casais"

is there anyway in Python to do that? what string function should i use? i want o avoid using regular expression in this case.

import urllib.request
f = urllib.request.urlopen("http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07")
s = f.read()

print(s.split())

I got this so far, but now I am having trouble finding the number I need.

import urllib.request
f = urllib.request.urlopen("http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07")
s = f.read()

num = int(s[s.index("encontrados")+len("encontrados"):s.index("casais")])

this give me the error bellow

TypeError: Type str doesn't support the buffer API

Ben · Accepted Answer · 2013-07-06 19:49:24Z

5

I'd recommend using a library such as Beautiful Soup if it's HTML you want to parse. No need for regex.

EDIT

Using the URL you just added, this is the sample code to get the HTML object out:

import BeautifulSoup
import re
import urllib

data = urllib.urlopen('http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07').read()
soup = BeautifulSoup.BeautifulSoup(data)
element = soup.find('span', attrs={'class': re.compile(r".*\btxt_resultad_busca_casamento\b.*")})
print element.text

This will find the HTML span element on the page that has the class txt_resultad_busca_casamento, which I believe is the data you're trying to extract. From there you can just parse the .text attribute to get the exact data you're interested in.

EDIT 2

Oops, just realised that uses regular expressions... it seems class matching in BeautifulSoup isn't perfect! This line should work instead, at least until the site changes their HTML:

element = soup.find('div', attrs={'id': 'ctl00_body_uppBusca'}).find('span')

edited Jul 6, 2013 at 19:49

answered Jul 6, 2013 at 19:38

Ben

6,7852 gold badges35 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ben Over a year ago

Not a problem. I would have helped with parsing the actual data, but that URL seems to return "Não foram encontrados casais" rather than "Resultado de Busca: Foram encontrados 264 casais".

DRC · Accepted Answer · 2013-07-06 19:33:45Z

1

Given that you can't parse html with regular expression, if you treat your file as a bag of text you have to use regex or something like:

a = 'Resultado de Busca: Foram encontrados 264 casais' #your page text
num = int(a[a.index("encontrados")+len("encontrados"):a.index("casais")])

answered Jul 6, 2013 at 19:33

DRC

5,0582 gold badges23 silver badges36 bronze badges

Comments

sedavidw · Accepted Answer · 2013-07-06 19:32:28Z

0

Are you positive of the format of that string? If you have a string like that (and always will) you can use:

s = "Resultado de Busca: Foram encontrados 264 casais"
items = s.split()

Your number would be indexed at 5 in items.

answered Jul 6, 2013 at 19:32

sedavidw

11.8k18 gold badges70 silver badges104 bronze badges

Collectives™ on Stack Overflow

search for a string inside html source with python (3.3.1)

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related