2

<p>
 A 
 <span>die</span> 
  is thrown \(x = {-b \pm 
  <span>\sqrt</span>
  {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
both the throws?
</p>

In above html I need to remove only the tags within "\(tags\)" i.e \(x = {-b \pm <span>\sqrt</span> {b^2-4ac} \over 2a}\\). I have just started with beautifulsoup is there any way this can be achieved with beautifulsoup?

2 Answers 2

2

I came up with the solution to my question. Hope it helps others. Feel free to give me suggestion to improve the code.

from bs4 import BeautifulSoup
import re
html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p> <p> Test </p>"""

soup = BeautifulSoup(html, 'html.parser')
mathml_start_regex = re.compile(r'\\\(')
mathml_end_regex = re.compile(r'\\\)')

for p_tags in soup.find_all('p'):
    match = 0 #Flag set to 1 if '\(' is found and again set back to 0 if '\)' is found.
    for p_child in p_tags.children:
        try: #Captures Tags that contains \(
            if re.findall(mathml_start_regex, p_child.text):
                match += 1
        except: #Captures NavigableString that contains \(
            if re.findall(mathml_start_regex, p_child):
                match += 1
        try: #Replaces Tag with Tag's text
            if match == 1:
                p_child.replace_with(p_child.text)
        except: #No point in replacing NavigableString since they are just strings without Tags
            pass
        try: #Captures Tags that contains \)
            if re.findall(mathml_end_regex, p_child.text):
                match = 0
        except: #Captures NavigableString that contains \)
            if re.findall(mathml_end_regex, p_child):
                match = 0

Output:

<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      \sqrt
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>
<p> Test
</p>

In the above code I searched all 'p' tag and it returns bs4.element.ResultSet. In the first for loop I am iterating to the result set to get individual 'p' tags and in the second for loop and using the .children generator to iterate through the 'p' tags children (contains both navigable string and tags). Each 'p' tag's child is searched for the '\(', if found the match is set to 1 and if when iterating to the children that match is 1 then the tags in the particular child is removed using replace_with and finally the match is set to zero when '\)' is found.

Sign up to request clarification or add additional context in comments.

Comments

0

Beautiful soup alone can't get a substring. You can use regex along with it.

from bs4 import BeautifulSoup
import re

html = """<p>
     A 
     <span>die</span> 
      is thrown \(x = {-b \pm 
      <span>\sqrt</span>
      {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from
    both the throws?
    </p>"""

soup = BeautifulSoup(html, 'html.parser')

print re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)

Output:

[u'\\(x = {-b \\pm \n  \\sqrt\n  {b^2-4ac} \\over 2a}\\)']

Regex:

\\\(.*?\) - Get substring from ( to ).

If you want to strip the newlines and whitespaces, you can do it like so:

res = re.findall(r'\\\(.*?\)', soup.text, re.DOTALL)[0]
print ' '.join(res.split())

Output:

\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)

HTML wrappers around the string:

print BeautifulSoup(' '.join(res.split()))

Output:

<html><body><p>\(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\)</p></body></html>

7 Comments

Hi I expected the output to be [u'\(x = {-b \\pm \n \\sqrt\n {b^2-4ac} \\over 2a}\)']. Can you suggest changes in regex?
@waranlogesh Sure. Include the backslash also before (. Modified the solution.
is there a way to save the printed changes back to html?
@waranlogesh What do you mean? The html is already giving this output: A die is thrown \(x = {-b \pm \sqrt {b^2-4ac} \over 2a}\) twice. What is the probability of getting a sum 7 from both the throws?
I mean when i give print(soup.prettify()) am getting the original html without the changes.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.