1

I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.

<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>

I can get the first value by using

match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)

But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match

match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
                       '<td width="65.+?value="(.+?)"></td>').findall(html_source_det)

Perhaps I am unable to get it to work since the text is multiline, but I added "\n" at the end of the first line, so I thought this would resolve it but it did not.

What I am doing wrong?

The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.

I am obtaining the html_source like this:

new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()
8
  • 3
    Why regex over something like BeautifulSoup? Commented Jul 7, 2015 at 15:16
  • take a look into ^ and $ for regular expressions (instead of using \n) Commented Jul 7, 2015 at 15:19
  • Your code works for me. Pleae provide the shortest possible complete program that demonstrates your error. See stackoverflow.com/help/mcve for more info. Commented Jul 7, 2015 at 15:21
  • Take a look at this post that suggests why you shouldn't use a regex. Commented Jul 7, 2015 at 15:21
  • Rob: indeed your code works! Maybe it works because the html_source is a static string. I posted the string so you could see it but actually I get it by downloading it. I updated my question with the code showing how I get the html_source. Maybe there are some encoding issues or dirty not printable characters I need to get rid off.... Commented Jul 7, 2015 at 16:16

1 Answer 1

3

Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:

from bs4 import BeautifulSoup

html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''

soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']

Or more simply:

print soup.find('input', attrs={'name': 'T1'})['value']
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for this and your suggestions about not using regex with html. I will definitely have a look at this BeautifulSoup library.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.