Using REGEX to match elements between lines in Python

Question

I'm looking to use REGEX to extract quantity out of a shopping website. In the following example, I want to get "12.5 kilograms". However, the quantity within the first span is not always listed in kilograms; it could be lbs., oz., etc.

        <td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>

The code above is only a small portion of what is actually extracted using BeautifulSoup. Whatever the page is, the quantity is always within a span and is on a new line after

<td class="size-price last first" colspan="4">

I've used REGEX in the past but I am far from an expert. I'd like to know how to match elements between different lines. In this case between

<td class="size-price last first" colspan="4">

and

<span> <span class="strike">

This question appears to be off-topic because it is about parsing html with regex. — Hyperboreus
– Hyperboreus, Commented Mar 25, 2014 at 3:38

Community · Accepted Answer · 2017-05-23 10:26:00Z

1

Avoid parsing HTML with regex. Use the tool for the job, an HTML parser, like BeautifulSoup - it is powerful, easy to use and it can perfectly handle your case:

from bs4 import BeautifulSoup


data = """
<td class="size-price last first" colspan="4">
                    <span>12.5 kilograms </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""
soup = BeautifulSoup(data)

print soup.td.span.text

prints:

12.5 kilograms

Or, if the td is a part of a bigger structure, find it by class and get the first span's text out of it:

print soup.find('td', {'class': 'size-price'}).span.text

UPD (handling multiple results):

print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]

Hope that helps.

edited May 23, 2017 at 10:26

CommunityBot

11 silver badge

answered Mar 25, 2014 at 3:40

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

LaGuille Over a year ago

Thanks. Some pages contain more than one sizing option, resulting in multiple <td class="size-price last first" colspan="4"> ... With your code, I can print out the first sizing option appearing on the page. However, if I use print soup.find_all('td', {'class': 'size-price'}).span.text I get: AttributeError: 'ResultSet' object has no attribute 'span'

Collectives™ on Stack Overflow

Using REGEX to match elements between lines in Python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related