Regex to get text between multiple newlines in Python

Question

I am trying to split a text where it is between \n\n and \n, in that order. Take this string for example:

\n\nMy take on fruits.\n\nHealthy Fruits\nAn apple is a fruit and it\'s very good.\n\nPears are good as well. Bananas are very good too and healthy.\n\nSour Fruits\nOranges are on the sour side and contains a lot of vitamin C.\n\nGrapefruits are even more sour, if you can believe it.

My desired output is:

[('Healthy Fruits', "An apple is a fruit and it's very good.", 'Pears are good as well. Bananas are very good too and healthy.'), ('Sour Fruits', 'Oranges are on the sour side and contains a lot of vitamin C.', 'Grapefruits are even more sour, if you can believe it.')]

I want to parse like this because anything between \n\n and \n is the title and the rest is text under the title (So "Healthy Fruits" and "Sour Fruits" . Not sure if this is the best way to grab the titles and its text.

Maybe re.findall('(?<!\n)\n\n(.+)\n(?!\n)((?s:.*?))(?=\n\n|\Z)', text) will do. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Apr 20, 2021 at 14:20
@dawg Thanks, I edited my question. I wanted to group the last sentence with Grapefruits with the Oranges sentence as they are part of the same title. Would this be possible? — user112947
– user112947, Commented Apr 20, 2021 at 14:38
Why with Regex? I can would use another way, and btw why do you expect the last one?"Grapefruits are even more sour, if you can believe it."? — Nir Elbaz
– Nir Elbaz, Commented Apr 20, 2021 at 14:39
@WiktorStribiżew Thanks, I edited my question. I wanted to group the last sentence with Grapefruits with the Oranges sentence as they are part of the same title. Would this be possible? Right now it just takes the Oranges sentence instead of combining Oranges and Grapefruits sentences together into one string. I would like: [('Healthy Fruits', "An apple is a fruit and it's very good. Bananas are very good too and healthy."), ('Sour Fruits', 'Oranges are on the sour side and contains a lot of vitamin C.', 'Grapefruits are even more sour, if you can believe it.')] — user112947
– user112947, Commented Apr 20, 2021 at 14:39

dawg · Accepted Answer · 2021-04-20 15:29:32Z

1

Given:

txt='''\
\n\nMy take on fruits.\n\nHealthy Fruits\nAn apple is a fruit and it\'s very good.\n\nPears are good as well. Bananas are very good too and healthy.\n\nSour Fruits\nOranges are on the sour side and contains a lot of vitamin C.\n\nGrapefruits are even more sour, if you can believe it.'''

desired=[('Healthy Fruits',   "An apple is a fruit and it's very good.", 'Pears are good as well. Bananas are very good too and healthy.'),  ('Sour Fruits',   'Oranges are on the sour side and contains a lot of vitamin C.', 'Grapefruits are even more sour, if you can believe it.')]

You can use the regex:

r'\n\n([\s\S]*?)(?=(?:\n\n.*\n[^\n])|\Z)'

Demo

Python demo:

>>> sp=[tuple(re.split('\n+',l)) for l in re.findall(r'\n\n([\s\S]*?)(?=(?:\n\n.*\n[^\n])|\Z)',txt) if '\n' in l]

>>> sp
[('Healthy Fruits', "An apple is a fruit and it's very good.", 'Pears are good as well. Bananas are very good too and healthy.'), ('Sour Fruits', 'Oranges are on the sour side and contains a lot of vitamin C.', 'Grapefruits are even more sour, if you can believe it.')]

>>> sp==desired
True

answered Apr 20, 2021 at 15:29

dawg

105k24 gold badges143 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user112947 Over a year ago

Thank you very much. This was exactly what I was looking for.

Nir Elbaz · Accepted Answer · 2021-04-20 14:55:47Z

1

This not regex but it works:

text="\n\nMy take on fruits.\n\nHealthy Fruits\nAn apple is a fruit and it\'s very good. Bananas are very good too and healthy.\n\nSour Fruits\nOranges are on the sour side and contains a lot of vitamin C.\n\nGrapefruits are even more sour, if you can believe it."
    NewList=[]
    Newtext=text.split("\n\n")
    for line in Newtext:
        if line.find("\n")>=0:
            NewList.extend(line.split('\n'))
    
    NewList[len(NewList)-1]=str(NewList[len(NewList)-1])+str(Newtext[len(Newtext)-1])

answered Apr 20, 2021 at 14:55

Nir Elbaz

6462 gold badges7 silver badges20 bronze badges

1 Comment

user112947 Over a year ago

Thank you for your help!

Collectives™ on Stack Overflow

Regex to get text between multiple newlines in Python

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related