1

I refer the stack overflow

but it is not working.

how I solve my problem?

def clean_text(text):
    pattern = '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)' 
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '(http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '([ㄱ-ㅎㅏ-ㅣ]+)'  
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '<[^>]*>'        
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '[^\w\s]'        
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = '\([^)]*\)'  ## not working!!
    text = re.sub(pattern=pattern, repl='', string=text)
    return text   

text = '(abc_def) 좋은글! (이것도 지워조) http://1234.com 감사합니다. [email protected]ㅋㅋ<H1>thank you</H1>'
clean_text(text)

The result is abc_def 좋은글 이것도 지워조 감사합니다 thank you

My goal is 좋은글 감사합니다 thank you

4
  • Your question and the expected value doesn't really match? How do you want text to be cleaned up? Please update your "goal" Commented Jul 22, 2019 at 11:58
  • 1
    Swap the last two re.subs. First, use text = re.sub(pattern=r'\([^)]*\)', repl='', string=text) and then the '[^\w\s]' regex replacement. Commented Jul 22, 2019 at 12:09
  • Thanks a lot. you are so genius!!! Commented Jul 22, 2019 at 12:16
  • I posted an answer below, it takes time on a mobile. Commented Jul 22, 2019 at 12:24

2 Answers 2

1

Your [^\w\s] re.sub removes the parentheses and thus the last regex does not match. You may swap the last two re.subs and use

import re
def clean_text(text):
    pattern = '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)' 
    text = re.sub(pattern=pattern, repl='', string=text) 
    pattern = r'(?:http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+' 
    text = re.sub(pattern=pattern, repl='', string=text) 
    pattern = r'[ㄱ-ㅎㅏ-ㅣ]+' 
    text = re.sub(pattern=pattern, repl='', string=text) 
    pattern = r'<[^>]*>' 
    text = re.sub(pattern=pattern, repl='', string=text)  
    pattern = r'\s*\([^)]*\)' 
    text = re.sub(pattern=pattern, repl='', string=text)
    pattern = r'[^\w\s]' 
    text = re.sub(pattern=pattern, repl='', string=text)
    return text.strip()

text = '(abc_def) 좋은글! (이것도 지워조) http://1234.com 감사합니다. [email protected]ㅋㅋ<H1>thank you</H1>' 
print(clean_text(text))

See the online Python demo.

I suggest using raw string literals (note the r'' prefixes) and stripping the unnecessary spaces with text.strip(). The \s* in r'\s*\([^)]*\)' will remove 0 or more whitespaces before parentheses.

Sign up to request clarification or add additional context in comments.

Comments

1

Try this:

    def clean_text(text):
        pattern = '([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '(http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '([ㄱ-ㅎㅏ-ㅣ]+)'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '<[^>]*>'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '\([^)]*\)\s'  ## not working!!
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '[^\w\s+]'
        text = re.sub(pattern=pattern, repl='', string=text)
        pattern = '\s{2,}'
        text = re.sub(pattern=pattern, repl=' ', string=text)
        return text

The result will be exact 좋은글 감사합니다 thank you

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.