Can't get a regex pattern to work in Python

Question

I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.

<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>

I can get the first value by using

match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)

But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match

match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
                       '<td width="65.+?value="(.+?)"></td>').findall(html_source_det)

Perhaps I am unable to get it to work since the text is multiline, but I added "\n" at the end of the first line, so I thought this would resolve it but it did not.

What I am doing wrong?

The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.

I am obtaining the html_source like this:

new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()

take a look into ^ and $ for regular expressions (instead of using \n) — g3rv4
– g3rv4, Commented Jul 7, 2015 at 15:19
Your code works for me. Pleae provide the shortest possible complete program that demonstrates your error. See stackoverflow.com/help/mcve for more info. — Robᵩ
– Robᵩ, Commented Jul 7, 2015 at 15:21
Take a look at this post that suggests why you shouldn't use a regex. — Malik Brahimi
– Malik Brahimi, Commented Jul 7, 2015 at 15:21
Rob: indeed your code works! Maybe it works because the html_source is a static string. I posted the string so you could see it but actually I get it by downloading it. I updated my question with the code showing how I get the html_source. Maybe there are some encoding issues or dirty not printable characters I need to get rid off.... — moster67
– moster67, Commented Jul 7, 2015 at 16:16

heinst · Accepted Answer · 2015-07-07 15:22:32Z

3

Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:

from bs4 import BeautifulSoup

html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''

soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']

Or more simply:

print soup.find('input', attrs={'name': 'T1'})['value']

answered Jul 7, 2015 at 15:22

heinst

8,8048 gold badges47 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

moster67 Over a year ago

Thank you for this and your suggestions about not using regex with html. I will definitely have a look at this BeautifulSoup library.

Collectives™ on Stack Overflow

Can't get a regex pattern to work in Python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related