regex for loop over list in python

Question

I have this list

[<th align="left">
 <a href="blablabla">F</a>ojweousa</th>,
 <th align="left">
 <a href="blablabla">S</a>awdefrgt</th>, ...]

and want

the one single character after ">
the multiple characters between </a> and </th>,

to be concatenated so that i can move on with my life.

Here is my code

item2 = []
for element in items2:
    first_letter = re.search('">.</a', str(items2))
    second_letter = re.search(r'</a>[a-zA-Z0-9]</th>,', str(items2))
    item2.append([str(first_letter) + str(second_letter)])

I know i should do something like item2.group or item2.join but if i do, the output gets even more messy. Here is the output with the current code

[['<re.Match object; span=(155, 161), match=\'">F</a\'>None'],
 ['<re.Match object; span=(155, 161), match=\'">F</a\'>None'],
 ...]]

I would like the output to look like this so that i can use it in pd.dataframe:

[Fojweousa, Sawdefrgt, ...]

It is a list, that is why i cant use html bs4 or select methods.

"It is a list, that is why i cant use html bs4 or select methods." - Where did this list come from? Was bs4 used to create it? — Tomalak
– Tomalak, Commented Feb 11, 2021 at 9:46
Try item2 = [re.sub(r'<[^>]*>', '', x).strip() for x in items2]. But using BeautifulSoup would be the best solution, where you may strip tags like this. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 11, 2021 at 9:47
@Tomalak Yes. @Wiktor TypeError: expected string or bytes-like object. Here is the bs4 call: items2 = table.find_all('th', attrs={'align': 'left'})[1:] I cannot combine two bs4 methods like get_text() and find_all() Every time i do one find_all() i get lists and afterwards need to rely on regex. Which is annoying — id345678
– id345678, Commented Feb 11, 2021 at 9:53
Do it like result = [x.get_text() for x in table.find_all('th', attrs={'align': 'left'})[1:]] — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 11, 2021 at 9:55
@id345678 If you have used bs4 to create this list, you can use bs4 to extract the text you want. You don't ever want to use regex to parse HTML. — Tomalak
– Tomalak, Commented Feb 11, 2021 at 9:55

Wiktor Stribiżew · Accepted Answer · 2021-02-11 10:03:01Z

3

You can use the BeautifulSoup get_text() to get plain text from each element you found with find_all and strip to get rid of leading and trailing whitespace:

items2 = table.find_all('th', attrs={'align': 'left'})[1:]
result = [x.get_text().strip() for x in items2]

Here, .find_all('th', attrs={'align': 'left'}) finds all th elements with attribute align equal to left, and [1:] skips the first occurrence.

Next, [x.get_text().strip() for x in items2] is a list comprehension that iterates over the found items (items2, x is each single found element) and gets plain text from each x element using x.get_text() and strip() removes leading/trailing whitespace.

answered Feb 11, 2021 at 10:03

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

regex for loop over list in python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related