remove content between parentheses using python regex

Question

I have a text file like -

{[a] abc (b(c)d)}

I want to remove the content between these bracket [] and (()). so the output should be -

abc

I removed the content between parentheses but could not remove the content between this [] I have tried below code -

import re

with open('data.txt') as f:
    input = f.read()
    line = input.replace("{","")
    line = line.replace("}","")
    output = re.sub(r'\(.*\)', "", line)
    print output

The output is -

[a] abc

In my code first I replace the {} and then remove the content from () . I want to add \[.*\] in output = re.sub(r'$.*$', "", line) this line . But could not find a way to do this. I am still learning python. So I am facing this problem. please help.

Just a remark not directly related to this: python regexes are not really good at processing balanced bracketted expressions... — Serge Ballesta
– Serge Ballesta, Commented Apr 19, 2018 at 8:17
@Jan: I know about it, but AFAIK the standard library only contains the old re module and OP has an import re line... — Serge Ballesta
– Serge Ballesta, Commented Apr 19, 2018 at 8:26
All your replacements could be shortened to re.sub(r'[{}]|$.*$|\[.*\]', "", line) — revo
– revo, Commented Apr 19, 2018 at 8:48

Jan · Accepted Answer · 2018-04-19 08:45:26Z

4

Imo not as easy as it first might look, you'd very likely need some balanced (recursive) approach which could be achieved with the newer regex module:

import regex as re

string = "some lorem ipsum {[a] abc (b(c)d)} some other lorem ipsum {defg}"

rx_part = re.compile(r'{(.*?)}')
rx_nested_parentheses = re.compile(r'\((?:[^()]*|(?R))*\)')
rx_nested_brackets = re.compile(r'\[(?:[^\[\]]*|(?R))*\]')

for match in rx_part.finditer(string):
    part = rx_nested_brackets.sub('', 
        rx_nested_parentheses.sub('', 
            match.group(1))).strip()
    print(part)

Which would yield

abc
defg

The pattern is

\(         # opening parenthesis
(?:        # non.capturing group
    [^()]* # not ( nor )
    |      # or
    (?R)   # repeat the pattern
)*
\)

edited Apr 19, 2018 at 8:45

answered Apr 19, 2018 at 8:20

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Serge Ballesta Over a year ago

Your answer is indeed correct and informative. I simply wonder whether it is required here. OP did not say what should be done with unbalanced expression. +1 anyway for the recursive regex example...

Jan Over a year ago

@SergeBallesta: Thanks, let's wait and see what OP really wants.

jan Over a year ago

@Jan Thank you :D

Wiktor Stribiżew · Accepted Answer · 2018-04-19 09:14:37Z

2

You may check if a string contains [, ], (<no_parentheses_here>) or [no_brackets_here] substrings and remove them while there is a match.

import re                                    # Use standard re
s='{[a] abc (b(c)d)}'
rx = re.compile(r'\([^()]*\)|\[[^][]*]|[{}]')
while rx.search(s):                          # While regex matches the string
    s = rx.sub('', s)                        # Remove the matches
print(s.strip())                             # Strip whitespace and show the result
# => abc

See the Python demo

It will also work with paired nested (...) and [...], too.

Pattern details

$[^()]*$ - (, then any 0+ chars other than ( and ), and then )
| - or
\[[^][]*] - [, then any 0+ chars other than [ and ], and then ]
| - or
[{}] - a character class matching { or }.

answered Apr 19, 2018 at 9:14

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

4 Comments

jan Over a year ago

Thanks for the explanation :) @Wiktor Stribiżew

Wiktor Stribiżew Over a year ago

@jahan You may also reduce whitespaces if you prepend the pattern with \s*: re.compile(r'\s*(?:$[^()]*$|\[[^][]*]|[{}])')

jan Over a year ago

I have another question. In my text file the first line is [abcdef] and the second line is {[a] abc (b(c)d)} . So when I use this regex it removes the first line and makes an empty space on the first line . Outputs are like in first line - 1. and in second line 2. abc. I used strip() and lsrtip() but could not remove the blank spaces which is created in the first line. How can I solve this problem? @WiktorStribiżew

Wiktor Stribiżew Over a year ago

@jahan Can you please create a code demo? I am not quite sure I understand what you mean. If you read the contents from a file into a string, that should not be a problem. If strip does not work, that may be not a whitespace at all, but some LTR or RTL marks, or other weird Unicode chars. Also, try re.sub(r'^\W+|\W+$', '', s)

M.H Mighani · Accepted Answer · 2018-04-19 08:33:31Z

1

i tried this and i got your desired output...i hope i got you right

import re

with open('aa.txt') as f:
    input = f.read()
    line = input.replace("{","")
    line = line.replace("}","")
    output = re.sub(r'\[.*\]', "", line)
    output = re.sub(r'\(.*\)', "", output)
    print(output)

answered Apr 19, 2018 at 8:33

M.H Mighani

1985 silver badges22 bronze badges

Collectives™ on Stack Overflow

remove content between parentheses using python regex

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related