Using pandas to read downloaded html file

Question

As title, I tried using read_html but give me the following error:

In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
  File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6

What have I done wrong?

update 01

The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?

pandas.pydata.org/pandas-docs/dev/generated/…: "flavor : str or None, container of strings The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib." My guess is that the html is not well formed and that the parsing is failing. try different parser? flavor='bs4' — Joop
– Joop, Commented Jul 31, 2014 at 10:52
just another one. if not clear from precious note. read_html method can use beautifull soup as parser check out the pandas documentation in link above. the syntax error sounds to me as if the HTML is not well formed, using a different aprser might be more tolerant. — Joop
– Joop, Commented Jul 31, 2014 at 10:57
Using lxml for this (and really any malformed HTML) is a bad idea. You should pip install beautifulsoup4 and pip install html5lib and call read_html without any flavor argument. These will be much slower, but I'll take slow and correct over fast and incorrect any day. Honestly, we should have thrown out lxml from the beginning, but it's a bit too late for that. — Phillip Cloud
– Phillip Cloud, Commented Jul 31, 2014 at 12:49
And you're getting this error because I force lxml to be strict. In the past lxml has dropped data on certain pieces of malformed HTML, which IMHO is just not cool. The other libs, OTOH do not do this and consequently do not drop data. — Phillip Cloud
– Phillip Cloud, Commented Jul 31, 2014 at 12:51

ZJS · Accepted Answer · 2014-07-31 21:34:32Z

7

I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.

You would want to do something like this...

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input

answered Jul 31, 2014 at 21:34

ZJS

4,0812 gold badges18 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user5051310 Over a year ago

couldn't get this solution to work (but i also couldn't install lxml which probably had something to do with it). However, df = pd.read_html('path/to/file.html', flavor='bs4') worked fine.

srana · Accepted Answer · 2018-01-05 08:06:13Z

3

first of all install below packages for parsing purpose
- pip install BeautifulSoup4
- pip install lxml
- pip install html5lib

then use 'read_html' to read html table on any html page.

import pandas as pds
pds_df = pds.read_html('C:/age0.html')
pds_df[0]

I hope this will help.

Good Luck!!

answered Jan 5, 2018 at 8:06

srana

513 bronze badges

Collectives™ on Stack Overflow

Using pandas to read downloaded html file

update 01

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

update 01

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related