How can I remove texts within parentheses with a regex in python?

Question

but it is not working.

how I solve my problem?

def clean_text(text):
    pattern = '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)' 
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '(http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '([ㄱ-ㅎㅏ-ㅣ]+)'  
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '<[^>]*>'        
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '[^\w\s]'        
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '\([^)]*\)'  ## not working!!
    text = re.sub(pattern=pattern, repl='', string=text)
    return text   

text = '(abc_def) 좋은글! (이것도 지워조) http://1234.com 감사합니다. [email protected]ㅋㅋ<H1>thank you</H1>'
clean_text(text)

The result is abc_def 좋은글 이것도 지워조 감사합니다 thank you

My goal is 좋은글 감사합니다 thank you

Your question and the expected value doesn't really match? How do you want text to be cleaned up? Please update your "goal" — abdusco
– abdusco, Commented Jul 22, 2019 at 11:58
Swap the last two re.subs. First, use text = re.sub(pattern=r'\([^)]*\)', repl='', string=text) and then the '[^\w\s]' regex replacement. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 22, 2019 at 12:09

user3483203 · Accepted Answer · 2019-07-22 12:30:24Z

Your [^\w\s] re.sub removes the parentheses and thus the last regex does not match. You may swap the last two re.subs and use

import re
def clean_text(text):
    pattern = '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)' 
    text = re.sub(pattern=pattern, repl='', string=text) 
    pattern = r'(?:http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+' 
    text = re.sub(pattern=pattern, repl='', string=text) 
    pattern = r'[ㄱ-ㅎㅏ-ㅣ]+' 
    text = re.sub(pattern=pattern, repl='', string=text) 
    pattern = r'<[^>]*>' 
    text = re.sub(pattern=pattern, repl='', string=text)  
    pattern = r'\s*\([^)]*\)' 
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = r'[^\w\s]' 
    text = re.sub(pattern=pattern, repl='', string=text)
    return text.strip()

text = '(abc_def) 좋은글! (이것도 지워조) http://1234.com 감사합니다. [email protected]ㅋㅋ<H1>thank you</H1>' 
print(clean_text(text))

See the online Python demo.

I suggest using raw string literals (note the r'' prefixes) and stripping the unnecessary spaces with text.strip(). The \s* in r'\s*\([^)]*\)' will remove 0 or more whitespaces before parentheses.

Viktoriia Kachanovska · Accepted Answer · 2019-07-22 12:14:46Z

Try this:

    def clean_text(text):
        pattern = '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '(http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '([ㄱ-ㅎㅏ-ㅣ]+)'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '<[^>]*>'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '\([^)]*\)\s'  ## not working!!
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '[^\w\s+]'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '\s{2,}'
        text = re.sub(pattern=pattern, repl=' ', string=text)
        return text

The result will be exact 좋은글 감사합니다 thank you

Collectives™ on Stack Overflow

How can I remove texts within parentheses with a regex in python?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related