0

I would like to split a text document on two new-line characters:

# document example
field1: content asd..\n\nfield2: content qwe...\n\nfield3: content asfdqegt

but sometimes fields contain new-line characters within their content (see field2):

field1: content asd..\n\nfield2: content\n\nqwe...\n\nfield3: content asfdqegt

because of this, I can't use \n\n as separator


actual behavior:

s = "field1: content asd..\n\nfield2: content\n\nqwe...\n\nfield3: content asfdqegt"
s.split("\n\n")
['field1: content asd..',
 'field2: content',
 'qwe...',
 'field3: content asfdqegt']

expected output (need to replace \n\n between field2: and field3:, not all \n\n within document):

s.split("\n\n")
['field1: content asd..', 'field2: contentqwe...', 'field3: content asfdqegt']

my attempt:

import re
re.sub(r"(?<=field1: )(\n)(?<=field3: )", "", s) # does nothing
re.sub(r"\n", "", s) # replaces all \n, not just between field2 and field3

3 Answers 3

3

You can match from field to field and replace the newlines from the matches.

^field\d+:.*(?:\n(?!field\d+:).*)*
  • ^ Start of string
  • field\d+:.* Match field followed by 1+ digits, : and the rest of the line
  • (?: Non capture group to repeat as a whole
    • \n Match a newline
    • (?!field\d+:) Assert that the string does not start with the field pattern
    • .* If the assertion is true, match the whole line
  • )* Close the group and optionally repeat

As an example

import re

s = "field1: content asd..\n\nfield2: content\n\nqwe...\n\nfield3: content asfdqegt"
pattern = r"^field\d+:.*(?:\n(?!field\d+:).*)*"
res = [x.replace('\n', '') for x in re.findall(pattern, s, re.MULTILINE)]
print (res)

Output

['field1: content asd..', 'field2: contentqwe...', 'field3: content asfdqegt']

See a regex demo and and Python demo

Sign up to request clarification or add additional context in comments.

3 Comments

Just FYI: you do not need \r? as the . matches a CR symbol even without re.S/re.DOTALL.
@WiktorStribiżew Thank you, I think somebody also told me that in the past..I just forgot about it. That is not always the case that the . matches a \r right? Let me read this page again.
Yes, not always, in PCRE, it is controlled with (*ANYCRLF) and suchlike, in Java, there is Pattern.UNIX_LINES (the (?d) modifier).
2

You can use

import re
s = "field1: content asd..\n\nfield2: content\n\nqwe...\n\nfield3: content asfdqegt"
output = [x.replace('\n', '') for x in re.split(r"\n\n(?=\w+:)", s)]
print(output)
# => ['field1: content asd..', 'field2: contentqwe...', 'field3: content asfdqegt']

See the online demo. See also the regex demo.

The \n\n(?=\w+:) pattern matches two LF chars that are immediately followed with one or more word chars and then a : char. After the string is split with this pattern, any LF char is removed from each chunk with .replace('\n', '').

Comments

0

If your field identifier is always "fieldX", you could use that to split as well:

>>> s.split('\n\nfield')
['field1: content asd..', '2: content\n\nqwe...', '3: content asfdqegt']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.