Python list processing to extract substrings

Question

I parsed an HTML page via beautifulsoup, extracting all div elements with specific class names into a list.

I now have to clean out HTML strings from this list, leaving behind string tokens I need.

The list I start with looks like this:

[<div class="info-1">\nName1a    <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b    <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a    <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b    <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a    <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b    <span class="bold">Score3b</span>\n</div>]

The whitespaces are deliberate. I need to reduce that list to:

[('Name1a', 'Score1a'), ('Name1b', 'Score1b'), ('Name2a', 'Score2a'), ('Name2b', 'Score2b'), ('Name3a', 'Score3a'), ('Name3b', 'Score3b')]

What's an efficient way to parse out substrings like this?

I've tried using the split method (e.g. [item.split('<div class="info-1">\n',1) for item in string_list]), but splitting just results in a substring that requires further splitting (hence inefficient). Likewise for using replace.

I feel I ought to go the other way around and extract the tokens I need, but I can't seem to wrap my head around an elegant way to do this. Being new to this hasn't helped either. I appreicate your help.

宏杰李 · Accepted Answer · 2017-02-11 08:59:44Z

1

Do not convert BS object to string unless you really need to do that.
Use CSS selector to find the class that starts with info
Use stripped_strings to get all the non-empty strings under a tag
Use tuple() to convert an iterable to tuple object

import bs4

html = '''<div class="info-1">\nName1a    <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b    <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a    <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b    <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a    <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b    <span class="bold">Score3b</span>\n</div>'''

soup = bs4.BeautifulSoup(html, 'lxml')

for div in soup.select('div[class^="info"]'):
    t = tuple(text for text in div.stripped_strings)
    print(t)

out:

('Name1a', 'Score1a')
('Name1b', 'Score1b')
('Name2a', 'Score2a')
('Name2b', 'Score2b')
('Name3a', 'Score3a')
('Name3b', 'Score3b')

answered Feb 11, 2017 at 8:59

宏杰李

12.2k2 gold badges32 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Hassan Baig Over a year ago

This is just great. Thanks a bunch :-)

Collectives™ on Stack Overflow

Python list processing to extract substrings

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related