Extracting strings from HTML with Python wont work with regex or BeautifulSoup

Question

Im using Python 2.7, BeautifulSoup4, regex, and requests on windows 7.

I've scraped some code from a website and I am having problems parsing and extracting the bits I want and storing them in a dictionary. What I'm after is text that is presented as follows in the code:

@CAD_DTA\">I WANT THIS@G@H@CAD_LBL

there are about 50-60 short strings I want to extract and store and they are all preceded by @CAD_DTA\"> and followed by @G@H@CAD_LBL in the code. These strings are all of variable length

I've tried:

re.search('@CAD_DTA\">(.+?)@G@H@CAD_LBL',result.text)

where result is the output of s.post(url, data = cookie, headers = {'referer': my_referer})

Ive also tried passing str(result.text)

but re.search keeps returning None. It's odd because if I literally copy and paste the content of result.text into a string and pass that through re.search it works fine.

Ive tried using re.search('@CAD_DTA">(.+?)@G@H@CAD_LBL',result.text) in case the \ is being treated as an escape or something. I dunno.

Can someone point me in the right direction?

Is there a literal backslash before the double quote? re.search(r'@CAD_DTA\\">(.+?)@G@H@CAD_LBL',result.text) should work then. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jun 22, 2015 at 16:52
That works! Thanks. I had tried the double backslash but without the 'r'. Anyway to reference the location that the string was found? So I can then go and search again starting at that position. — Gustavo Costa
– Gustavo Costa, Commented Jun 22, 2015 at 17:10

Wiktor Stribiżew · Accepted Answer · 2015-06-22 17:14:21Z

1

In order to match the string with a literal backlash, you need to double-escape it in a raw string, e.g.:

re.search(r'@CAD_DTA\\">(.+?)@G@H@CAD_LBL',result.text)
          ^          ^

In order to get the index of the found match, you can use start([group]) of re.MatchObject

IDEONE demo:

import re
obj = re.search(r'@CAD_DTA\\">(.+?)@G@H@CAD_LBL', 'Some text here...@CAD_DTA\\">I WANT THIS@G@H@CAD_LBL')
print obj.start(1)
print obj.group(1)

answered Jun 22, 2015 at 17:14

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wiktor Stribiżew Over a year ago

I am also happy to help someone using appropriate tools for concrete tasks :)

Collectives™ on Stack Overflow

Extracting strings from HTML with Python wont work with regex or BeautifulSoup

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related