I parsed an HTML page via beautifulsoup, extracting all div elements with specific class names into a list.
I now have to clean out HTML strings from this list, leaving behind string tokens I need.
The list I start with looks like this:
[<div class="info-1">\nName1a <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b <span class="bold">Score3b</span>\n</div>]
The whitespaces are deliberate. I need to reduce that list to:
[('Name1a', 'Score1a'), ('Name1b', 'Score1b'), ('Name2a', 'Score2a'), ('Name2b', 'Score2b'), ('Name3a', 'Score3a'), ('Name3b', 'Score3b')]
What's an efficient way to parse out substrings like this?
I've tried using the split method (e.g. [item.split('<div class="info-1">\n',1) for item in string_list]), but splitting just results in a substring that requires further splitting (hence inefficient). Likewise for using replace.
I feel I ought to go the other way around and extract the tokens I need, but I can't seem to wrap my head around an elegant way to do this. Being new to this hasn't helped either. I appreicate your help.