3

I parsed an HTML page via beautifulsoup, extracting all div elements with specific class names into a list.

I now have to clean out HTML strings from this list, leaving behind string tokens I need.

The list I start with looks like this:

[<div class="info-1">\nName1a    <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b    <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a    <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b    <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a    <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b    <span class="bold">Score3b</span>\n</div>]

The whitespaces are deliberate. I need to reduce that list to:

[('Name1a', 'Score1a'), ('Name1b', 'Score1b'), ('Name2a', 'Score2a'), ('Name2b', 'Score2b'), ('Name3a', 'Score3a'), ('Name3b', 'Score3b')]

What's an efficient way to parse out substrings like this?


I've tried using the split method (e.g. [item.split('<div class="info-1">\n',1) for item in string_list]), but splitting just results in a substring that requires further splitting (hence inefficient). Likewise for using replace.

I feel I ought to go the other way around and extract the tokens I need, but I can't seem to wrap my head around an elegant way to do this. Being new to this hasn't helped either. I appreicate your help.

1 Answer 1

1
  1. Do not convert BS object to string unless you really need to do that.
  2. Use CSS selector to find the class that starts with info
  3. Use stripped_strings to get all the non-empty strings under a tag
  4. Use tuple() to convert an iterable to tuple object

import bs4

html = '''<div class="info-1">\nName1a    <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b    <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a    <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b    <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a    <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b    <span class="bold">Score3b</span>\n</div>'''

soup = bs4.BeautifulSoup(html, 'lxml')

for div in soup.select('div[class^="info"]'):
    t = tuple(text for text in div.stripped_strings)
    print(t)

out:

('Name1a', 'Score1a')
('Name1b', 'Score1b')
('Name2a', 'Score2a')
('Name2b', 'Score2b')
('Name3a', 'Score3a')
('Name3b', 'Score3b')
Sign up to request clarification or add additional context in comments.

1 Comment

This is just great. Thanks a bunch :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.