4

I am working on a project to get information from a web page. in the html source I have the following:

Resultado de Busca: Foram encontrados 264 casais

I need to get the number between "encontrados" and "casais"

is there anyway in Python to do that? what string function should i use? i want o avoid using regular expression in this case.

import urllib.request
f = urllib.request.urlopen("http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07")
s = f.read()

print(s.split())

I got this so far, but now I am having trouble finding the number I need.

import urllib.request
f = urllib.request.urlopen("http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07")
s = f.read()

num = int(s[s.index("encontrados")+len("encontrados"):s.index("casais")])

this give me the error bellow

TypeError: Type str doesn't support the buffer API

0

3 Answers 3

5

I'd recommend using a library such as Beautiful Soup if it's HTML you want to parse. No need for regex.

EDIT

Using the URL you just added, this is the sample code to get the HTML object out:

import BeautifulSoup
import re
import urllib

data = urllib.urlopen('http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07').read()
soup = BeautifulSoup.BeautifulSoup(data)
element = soup.find('span', attrs={'class': re.compile(r".*\btxt_resultad_busca_casamento\b.*")})
print element.text

This will find the HTML span element on the page that has the class txt_resultad_busca_casamento, which I believe is the data you're trying to extract. From there you can just parse the .text attribute to get the exact data you're interested in.

EDIT 2

Oops, just realised that uses regular expressions... it seems class matching in BeautifulSoup isn't perfect! This line should work instead, at least until the site changes their HTML:

element = soup.find('div', attrs={'id': 'ctl00_body_uppBusca'}).find('span')
Sign up to request clarification or add additional context in comments.

1 Comment

Not a problem. I would have helped with parsing the actual data, but that URL seems to return "Não foram encontrados casais" rather than "Resultado de Busca: Foram encontrados 264 casais".
1

Given that you can't parse html with regular expression, if you treat your file as a bag of text you have to use regex or something like:

a = 'Resultado de Busca: Foram encontrados 264 casais' #your page text
num = int(a[a.index("encontrados")+len("encontrados"):a.index("casais")])

Comments

0

Are you positive of the format of that string? If you have a string like that (and always will) you can use:

s = "Resultado de Busca: Foram encontrados 264 casais"
items = s.split()

Your number would be indexed at 5 in items.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.