2

I want to capture all the lines from a string of text using regex. How do I do that? None of these work. The first one almost works, but doesn't catch \r\n

import re

given_text = '1stline\n2ndline\r3rdline\r\n4thline'
list_of_lines = re.findall('(?m)^.*$', given_text)
print(list_of_lines)

list_of_lines = re.findall('(?m)^.*(\r\n|\r|\n|$)', given_text)
print(list_of_lines)

list_of_lines = re.findall(r'(?m)^.*?(\r\n|\r|\n|$)', given_text)
print(list_of_lines)

1
  • 2
    To match all non-empty lines, you can use re.findall('[^\r\n]+', given_text). Or, you may use re.split(r'\r\n?|\n', given_text) if you need to get empty lines, too. Commented Apr 22, 2021 at 19:01

3 Answers 3

2

Certainly splitlines() is the right tool for the job.

The following solutions may help if all you need is to deal with CR, \r (carriage return) and LF, \n (line feed character):

re.findall('[^\r\n]+', given_text) # Returns all non-empty lines split with one or more CR/LF chars
re.split(r'\r\n?|\n', given_text)  # Splits with the most common CRLF, CR or LF line endings

Note the re.split solution will return empty lines, too.

Details

  • [^\r\n]+ - one or more chars other than CR and LF chars
  • \r\n?|\n - a CR and an optional LF char (\r\n?) or (|) a newline, LF, only (\n)

If you need to support all possible Unicode line breaks, you can use

re.findall(r'[^\r\n\x0B\x0C\x85\u2028\u2029]+', given_text)
re.split(r'\r\n?|[\n\x0B\x0C\x85\u2028\u2029]', given_text)

NOTES:

Char Description
\r (\x0D) CARRIAGE RETURN, CR
\n (\x0A) LINE FEED, LF
\x0B LINE TABULATION, LT
\x0C FORM FEED, FF
‎\x85 NEXT LINE, NEL
\u‎2028 LINE SEPARATOR, LS
\u‎2029 PARAGRAPH SEPARATOR, PS

See a Python demo:

import re
given_text = '1stline\n2ndline\r3rdline\r\n4thline\r\n\r\nLast Line after an empty line'
print( re.findall('[^\r\n]+', given_text) )
# => ['1stline', '2ndline', '3rdline', '4thline', 'Last Line after an empty line']
print( re.split(r'\r\n?|\n', given_text) )
# => ['1stline', '2ndline', '3rdline', '4thline', '', 'Last Line after an empty line']
print( re.findall(r'[^\r\n\x0B\x0C\x85\u2028\u2029]+', given_text) )
# => ['1stline', '2ndline', '3rdline', '4thline', 'Last Line after an empty line']
print( re.split(r'\r\n?|[\n\x0B\x0C\x85\u2028\u2029]', given_text) )
# => ['1stline', '2ndline', '3rdline', '4thline', '', 'Last Line after an empty line']
Sign up to request clarification or add additional context in comments.

1 Comment

I appreciate the thoroughness. Regex seems like it should be simple, but there are so many weird subtleties that trip me up.
2

This code gives you the list of lines with regex:

import re
given_text = '1stline\n2ndline\r3rdline\r\n4thline'
list_of_lines = re.split(r'\r\n|\r|\n', given_text) 
print(list_of_lines)

result:

['1stline', '2ndline', '3rdline', '4thline']

2 Comments

Thanks, Franco. This seems to work well. I think Wiktor's works too and is a little bit more concise.
@RyanB.Jawad I posted the full answer. I have been tricked with Unicode line break chars so much in the past that I decided to include them into the solution.
1

While it doesn't use regex,

given_text.splitlines()

will produce

['1stline', '2ndline', '3rdline', '4thline']

Edit: Per your commented request, if you have to use regex,

re.split("\n\r+|\r\n+|\n+|\r+", given_text)

will also produce

['1stline', '2ndline', '3rdline', '4thline']

2 Comments

That's helpful. Thanks. I still would like to know how to do it with regex.
Updated with one method using regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.