replace character between patterns

Question

I would like to split a text document on two new-line characters:

# document example
field1: content asd..\n\nfield2: content qwe...\n\nfield3: content asfdqegt

but sometimes fields contain new-line characters within their content (see field2):

field1: content asd..\n\nfield2: content\n\nqwe...\n\nfield3: content asfdqegt

because of this, I can't use \n\n as separator

actual behavior:

s = "field1: content asd..\n\nfield2: content\n\nqwe...\n\nfield3: content asfdqegt"
s.split("\n\n")
['field1: content asd..',
 'field2: content',
 'qwe...',
 'field3: content asfdqegt']

expected output (need to replace \n\n between field2: and field3:, not all \n\n within document):

s.split("\n\n")
['field1: content asd..', 'field2: contentqwe...', 'field3: content asfdqegt']

my attempt:

import re
re.sub(r"(?<=field1: )(\n)(?<=field3: )", "", s) # does nothing
re.sub(r"\n", "", s) # replaces all \n, not just between field2 and field3

The fourth bird · Accepted Answer · 2021-06-17 20:24:40Z

3

You can match from field to field and replace the newlines from the matches.

^field\d+:.*(?:\n(?!field\d+:).*)*

^ Start of string
field\d+:.* Match field followed by 1+ digits, : and the rest of the line
(?: Non capture group to repeat as a whole
- \n Match a newline
- (?!field\d+:) Assert that the string does not start with the field pattern
- .* If the assertion is true, match the whole line
)* Close the group and optionally repeat

As an example

import re

s = "field1: content asd..\n\nfield2: content\n\nqwe...\n\nfield3: content asfdqegt"
pattern = r"^field\d+:.*(?:\n(?!field\d+:).*)*"
res = [x.replace('\n', '') for x in re.findall(pattern, s, re.MULTILINE)]
print (res)

Output

['field1: content asd..', 'field2: contentqwe...', 'field3: content asfdqegt']

See a regex demo and and Python demo

edited Jun 17, 2021 at 20:24

answered Jun 17, 2021 at 19:33

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Wiktor Stribiżew Over a year ago

Just FYI: you do not need \r? as the . matches a CR symbol even without re.S/re.DOTALL.

The fourth bird Over a year ago

@WiktorStribiżew Thank you, I think somebody also told me that in the past..I just forgot about it. That is not always the case that the . matches a \r right? Let me read this page again.

Wiktor Stribiżew Over a year ago

Yes, not always, in PCRE, it is controlled with (*ANYCRLF) and suchlike, in Java, there is Pattern.UNIX_LINES (the (?d) modifier).

Wiktor Stribiżew · Accepted Answer · 2021-06-17 19:29:48Z

2

You can use

import re
s = "field1: content asd..\n\nfield2: content\n\nqwe...\n\nfield3: content asfdqegt"
output = [x.replace('\n', '') for x in re.split(r"\n\n(?=\w+:)", s)]
print(output)
# => ['field1: content asd..', 'field2: contentqwe...', 'field3: content asfdqegt']

See the online demo. See also the regex demo.

The \n\n(?=\w+:) pattern matches two LF chars that are immediately followed with one or more word chars and then a : char. After the string is split with this pattern, any LF char is removed from each chunk with .replace('\n', '').

answered Jun 17, 2021 at 19:29

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Comments

Benoit Dufresne · Accepted Answer · 2021-06-17 19:41:36Z

0

If your field identifier is always "fieldX", you could use that to split as well:

>>> s.split('\n\nfield')
['field1: content asd..', '2: content\n\nqwe...', '3: content asfdqegt']

answered Jun 17, 2021 at 19:41

Benoit Dufresne

3433 silver badges10 bronze badges

Collectives™ on Stack Overflow

replace character between patterns

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related