Python - BeautifulSoup error while scraping

Question

UPDATE: Using lxml instead of html.parser helped solve the problem, as Freddier suggested in the answer below!

I am trying to webscrape some information off of this website: https://www.ticketmonster.co.kr/deal/952393926.

I get an error when I run soup(thispage, 'html.parser) but this error only happens for this specific page. Does anyone know why this is happening?

The code I have so far is very simple:

from bs4 import BeautifulSoup as soup

openU = urlopen(url)
thispage = openU.read()
open.close()

pageS = soup(thispage, 'html.parser')

The error I get is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py", line 228, in __init__
    self._feed()
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site- packages\bs4\__init__.py", line 289, in _feed
    self.builder.feed(self.markup)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 215, in feed
    parser.feed(markup)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 111, in feed
    self.goahead(0)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 179, in goahead
    k = self.parse_html_declaration(i)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 264, in parse_html_declaration
    return self.parse_marked_section(i)
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 149, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 391, in _scan_name
    % rawdata[declstartpos:declstartpos+20])
  File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 34, in error
    "subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()

Please help!

Freddy · Accepted Answer · 2018-04-12 03:03:03Z

2

Try using

pageS = soup(thispage, 'lxml')

insted of

pageS = soup(thispage, 'html.parser')

It looks may be a problem with characters encoding using "html.parser"

edited Apr 12, 2018 at 3:03

answered Apr 12, 2018 at 2:28

Freddy

8891 gold badge9 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

wwii Over a year ago

Please do not post images of code or data. Copy from your editor/ide and paste it as text formatted as code - Formatting help .

SuperStew Over a year ago

"As you are in Python3, is preferable to use mechanicalsoup" How do you figure that? bs4 is widely used and I've never even heard of mechanicalsoup.

Freddy Over a year ago

sorry @wwii I did want to show the code result. I'll edit add the code

Collectives™ on Stack Overflow

Python - BeautifulSoup error while scraping

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related