0

I have a text like this:

text = "Text1.
        Textt « text2 »
        Some other text"

i want a regex code that is able to delete the text inside the quotes and the text before it till the dot.

so the output be like that :

text = "Text1.
    Some other text"

the code am stuck into :

text= re.sub(r'\s*.*?»', '', text)

what the code actually does is delete a more than expected here's an example :

text="Text1.
        Textt « text2 »
        Some other text
        Textt « text3 »
        other text"

the output i get is like this :

text="Text1.
    other text"
4
  • You probably want \..*?» as your RE, with the re.DOTALL option enabled so that .*? also matches newlines. Commented Mar 30, 2021 at 21:34
  • Your code seems to work, could you explain the difference between the actual and the expected output ? Commented Mar 30, 2021 at 21:35
  • It seems to work right? ideone.com/rw6sy9 Commented Mar 30, 2021 at 21:35
  • the code seems to work but i have a long text and it deletes a lot of text, so i thought something wrong with it Commented Mar 30, 2021 at 21:37

3 Answers 3

2

You may use:

import re

text = '''Text1.
       Textt « text2 »
       Some other text'''

text = re.sub(r'\.[^«]*«[^»]*»', '.', text)

print (text)

To get this output:

Text1.
    Some other text

RegEx Demo

RegEx Explained:

  • \.: Match a dot
  • [^«]*: Match 0 or more characters that are not «
  • «: Match a «
  • [^»]*: Match 0 or more characters that are not »
  • »: Match a »

We just replace this matched text with a single dot to get our desired output.

Sign up to request clarification or add additional context in comments.

Comments

1

You can use

re.sub(r'(\.)[^.]*«[^«»]*»', r'\1', text)

See the regex demo.

  • (\.) - Group 1 (\1 in the replacement refers to this captured value): a dot
  • [^.]* - zero or more chars other than a .
  • «[^«»]*» - a substring between « and » without other « and » inside.

See a Python demo:

import re
text = "Text1.\n        Textt « text2 »\n        Some other text"
print( re.sub(r'(\.)[^.]*«[^«»]*»', r'\1', text) )

4 Comments

@AdilKasbaoui Ok, you edited the question and now it is rather unclear. (\.)[\s\S]*«[^«»]*» would work for your example, but not sure it will work for the real input. The final solution will probably need to rely on more context.
@AdilKasbaoui It seems you need to add actual rules to explain what needs to be done.
@AdilKasbaoui Then use re.sub(r'(\.)[^.]*«[^«»]*»', r'\1', text), the answer is corrected.
@AdilKasbaoui Yes, it is easy: re.sub(r'([.?!])[^?!.]*«[^«»]*»', r'\1', text). I suspected this follow-up, that is my solution is based on a capturing group + backreference, it can be expanded.
1

Alternatively, you can use split

text = "Text1. Textt « text2 » Some other text"

text = text.split('«')[0].split('.')+'.' + text.split('»')[1] 

print(text)

output:

"Text1. Some other text"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.