How to remove any html tags within a specific pattern in beautifulsoup

Question

<p>
 A 
 <span>die</span> 
  is thrown \(x = {-b \pm 
  <span>\sqrt</span>
  {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
both the throws?
</p>

In above html I need to remove only the tags within "\(tags\)" i.e \(x = {-b \pm <span>\sqrt</span> {b^2-4ac} \over 2a}\\). I have just started with beautifulsoup is there any way this can be achieved with beautifulsoup?

waranlogesh · Accepted Answer · 2017-02-08 05:22:40Z

I came up with the solution to my question. Hope it helps others. Feel free to give me suggestion to improve the code.

from bs4 import BeautifulSoup
import re
html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p> <p> Test </p>"""

soup = BeautifulSoup(html, 'html.parser')
mathml_start_regex = re.compile(r'\\\(')
mathml_end_regex = re.compile(r'\\\)')

for p_tags in soup.find_all('p'):
    match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found.
    for p_child in p_tags.children:
        try: #Captures Tags that contains \(
            if re.findall(mathml_start_regex, p_child.text):
                match += 1
        except: #Captures NavigableString that contains \(
            if re.findall(mathml_start_regex, p_child):
                match += 1
        try: #Replaces Tag with Tag's text
            if match == 1:
                p_child.replace_with(p_child.text)
        except: #No point in replacing NavigableString since they are just strings without Tags
            pass
        try: #Captures Tags that contains \)
            if re.findall(mathml_end_regex, p_child.text):
                match = 0
        except: #Captures NavigableString that contains \)
            if re.findall(mathml_end_regex, p_child):
                match = 0

Output:

<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      \sqrt
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>
<p> Test
</p>

In the above code I searched all 'p' tag and it returns bs4.element.ResultSet. In the first for loop I am iterating to the result set to get individual 'p' tags and in the second for loop and using the .children generator to iterate through the 'p' tags children (contains both navigable string and tags). Each 'p' tag's child is searched for the '\(', if found the match is set to 1 and if when iterating to the children that match is 1 then the tags in the particular child is removed using replace_with and finally the match is set to zero when '\)' is found.

Mohammad Yusuf · Accepted Answer · 2017-02-04 13:03:10Z

0

Beautiful soup alone can't get a substring. You can use regex along with it.

from bs4 import BeautifulSoup
import re

html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>"""

soup = BeautifulSoup(html, 'html.parser')

print re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)

Output:

[u'\\(x = {-b \\pm \n  \\sqrt\n  {b^2-4ac} \\over 2a}\\)']

Regex:

\\\(.*?\) - Get substring from ( to ).

If you want to strip the newlines and whitespaces, you can do it like so:

res = re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)[0]
print ' '.join(res.split())

Output:

\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)

HTML wrappers around the string:

print BeautifulSoup(' '.join(res.split()))

Output:

<html><body><p>\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)</p></body></html>

edited Feb 4, 2017 at 13:03

answered Feb 4, 2017 at 12:21

Mohammad Yusuf

17.1k12 gold badges60 silver badges88 bronze badges

7 Comments

waranlogesh Over a year ago

Hi I expected the output to be [u'\(x = {-b \\pm \n \\sqrt\n {b^2-4ac} \\over 2a}\)']. Can you suggest changes in regex?

Mohammad Yusuf Over a year ago

@waranlogesh Sure. Include the backslash also before (. Modified the solution.

waranlogesh Over a year ago

is there a way to save the printed changes back to html?

Mohammad Yusuf Over a year ago

@waranlogesh What do you mean? The html is already giving this output:

A die is thrown \(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from both the throws?

waranlogesh Over a year ago

I mean when i give print(soup.prettify()) am getting the original html without the changes.

|

Collectives™ on Stack Overflow

How to remove any html tags within a specific pattern in beautifulsoup

2 Answers 2

Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related