4

I have this list

[<th align="left">
 <a href="blablabla">F</a>ojweousa</th>,
 <th align="left">
 <a href="blablabla">S</a>awdefrgt</th>, ...]

and want

  1. the one single character after ">

  2. the multiple characters between </a> and </th>,

to be concatenated so that i can move on with my life.

Here is my code

item2 = []
for element in items2:
    first_letter = re.search('">.</a', str(items2))
    second_letter = re.search(r'</a>[a-zA-Z0-9]</th>,', str(items2))
    item2.append([str(first_letter) + str(second_letter)])

I know i should do something like item2.group or item2.join but if i do, the output gets even more messy. Here is the output with the current code

[['<re.Match object; span=(155, 161), match=\'">F</a\'>None'],
 ['<re.Match object; span=(155, 161), match=\'">F</a\'>None'],
 ...]]

I would like the output to look like this so that i can use it in pd.dataframe:

[Fojweousa, Sawdefrgt, ...]

It is a list, that is why i cant use html bs4 or select methods.

7
  • 1
    "It is a list, that is why i cant use html bs4 or select methods." - Where did this list come from? Was bs4 used to create it? Commented Feb 11, 2021 at 9:46
  • Try item2 = [re.sub(r'<[^>]*>', '', x).strip() for x in items2]. But using BeautifulSoup would be the best solution, where you may strip tags like this. Commented Feb 11, 2021 at 9:47
  • @Tomalak Yes. @Wiktor TypeError: expected string or bytes-like object. Here is the bs4 call: items2 = table.find_all('th', attrs={'align': 'left'})[1:] I cannot combine two bs4 methods like get_text() and find_all() Every time i do one find_all() i get lists and afterwards need to rely on regex. Which is annoying Commented Feb 11, 2021 at 9:53
  • 1
    Do it like result = [x.get_text() for x in table.find_all('th', attrs={'align': 'left'})[1:]] Commented Feb 11, 2021 at 9:55
  • 1
    @id345678 If you have used bs4 to create this list, you can use bs4 to extract the text you want. You don't ever want to use regex to parse HTML. Commented Feb 11, 2021 at 9:55

1 Answer 1

3

You can use the BeautifulSoup get_text() to get plain text from each element you found with find_all and strip to get rid of leading and trailing whitespace:

items2 = table.find_all('th', attrs={'align': 'left'})[1:]
result = [x.get_text().strip() for x in items2]

Here, .find_all('th', attrs={'align': 'left'}) finds all th elements with attribute align equal to left, and [1:] skips the first occurrence.

Next, [x.get_text().strip() for x in items2] is a list comprehension that iterates over the found items (items2, x is each single found element) and gets plain text from each x element using x.get_text() and strip() removes leading/trailing whitespace.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.