8

As title, I tried using read_html but give me the following error:

In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
  File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6

What have I done wrong?

update 01

The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?

5
  • Well what's the content of age0.html? Commented Jul 31, 2014 at 10:05
  • pandas.pydata.org/pandas-docs/dev/generated/…: "flavor : str or None, container of strings The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib." My guess is that the html is not well formed and that the parsing is failing. try different parser? flavor='bs4' Commented Jul 31, 2014 at 10:52
  • just another one. if not clear from precious note. read_html method can use beautifull soup as parser check out the pandas documentation in link above. the syntax error sounds to me as if the HTML is not well formed, using a different aprser might be more tolerant. Commented Jul 31, 2014 at 10:57
  • Using lxml for this (and really any malformed HTML) is a bad idea. You should pip install beautifulsoup4 and pip install html5lib and call read_html without any flavor argument. These will be much slower, but I'll take slow and correct over fast and incorrect any day. Honestly, we should have thrown out lxml from the beginning, but it's a bit too late for that. Commented Jul 31, 2014 at 12:49
  • And you're getting this error because I force lxml to be strict. In the past lxml has dropped data on certain pieces of malformed HTML, which IMHO is just not cool. The other libs, OTOH do not do this and consequently do not drop data. Commented Jul 31, 2014 at 12:51

2 Answers 2

7

I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.

You would want to do something like this...

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input
Sign up to request clarification or add additional context in comments.

1 Comment

couldn't get this solution to work (but i also couldn't install lxml which probably had something to do with it). However, df = pd.read_html('path/to/file.html', flavor='bs4') worked fine.
3
  1. first of all install below packages for parsing purpose

    • pip install BeautifulSoup4
    • pip install lxml
    • pip install html5lib
  2. then use 'read_html' to read html table on any html page.


    import pandas as pds
    pds_df = pds.read_html('C:/age0.html')
    pds_df[0]
    

I hope this will help.

Good Luck!!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.