As title, I tried using read_html but give me the following error:
In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6
What have I done wrong?
update 01
The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?
lxmlfor this (and really any malformed HTML) is a bad idea. You shouldpip install beautifulsoup4andpip install html5liband callread_htmlwithout anyflavorargument. These will be much slower, but I'll take slow and correct over fast and incorrect any day. Honestly, we should have thrown outlxmlfrom the beginning, but it's a bit too late for that.lxmlto be strict. In the pastlxmlhas dropped data on certain pieces of malformed HTML, which IMHO is just not cool. The other libs, OTOH do not do this and consequently do not drop data.