1

I'm looking to use REGEX to extract quantity out of a shopping website. In the following example, I want to get "12.5 kilograms". However, the quantity within the first span is not always listed in kilograms; it could be lbs., oz., etc.

        <td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>

The code above is only a small portion of what is actually extracted using BeautifulSoup. Whatever the page is, the quantity is always within a span and is on a new line after

<td class="size-price last first" colspan="4">  

I've used REGEX in the past but I am far from an expert. I'd like to know how to match elements between different lines. In this case between

<td class="size-price last first" colspan="4">

and

<span> <span class="strike">
2
  • See here: stackoverflow.com/a/1732454/763505 Commented Mar 25, 2014 at 3:37
  • This question appears to be off-topic because it is about parsing html with regex. Commented Mar 25, 2014 at 3:38

1 Answer 1

1

Avoid parsing HTML with regex. Use the tool for the job, an HTML parser, like BeautifulSoup - it is powerful, easy to use and it can perfectly handle your case:

from bs4 import BeautifulSoup


data = """
<td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""
soup = BeautifulSoup(data)

print soup.td.span.text

prints:

12.5 kilograms 

Or, if the td is a part of a bigger structure, find it by class and get the first span's text out of it:

print soup.find('td', {'class': 'size-price'}).span.text

UPD (handling multiple results):

print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]

Hope that helps.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. Some pages contain more than one sizing option, resulting in multiple <td class="size-price last first" colspan="4"> ... With your code, I can print out the first sizing option appearing on the page. However, if I use print soup.find_all('td', {'class': 'size-price'}).span.text I get: AttributeError: 'ResultSet' object has no attribute 'span'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.