0

Python 3.5

See the code

import urllib.request
from xml.etree import ElementTree as ET

url = 'http://www.sat.gob.mx/informacion_fiscal/tablas_indicadores/Paginas/tipo_cambio.aspx'


def conectar(url):
    page = urllib.request.urlopen(url)
    return page.read()

root = ET.fromstring(conectar(url))
s = root.findall("//*[contains(.,'21/')]")

A need extract '21/', but return this error:

Erro:

Traceback (most recent call last):
  File "crawler.py", line 11, in <module>
    root = ET.fromstring(conectar(url))
  File "/home/rg3915/.pyenv/versions/3.5.0/lib/python3.5/xml/etree/ElementTree.py", line 1321, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: unbound prefix: line 146, column 8

But I do not know how to solve this error.

2
  • Why not using BeautifulSoup? Commented Dec 22, 2015 at 14:18
  • As it would be in this case? Commented Dec 22, 2015 at 14:20

2 Answers 2

1

You could start with:

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.sat.gob.mx/informacion_fiscal/tablas_indicadores/Paginas/tipo_cambio.aspx'
response = urllib2.urlopen(url)
html = response.read()
dom = BeautifulSoup(html, 'html.parser')

tables = dom.find_all("table")
if len(tables):
    table = tables[0]
    print table

(tested in python 2.7)

Sign up to request clarification or add additional context in comments.

Comments

1

While the document you are trying to parse claims to be xhtml, it is invalid xml due to the unbound prefix.

<gcse:search></gcse:search>

The gcse ns prefix is not defined for the document.

BeautifulSoup would probably be much better suited for what you are trying to do, because it is not fussy about the document being 100% valid.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.