In Python, how do I extract multiple blocks of text that begin with same pattern, but no distinct end?

Question

Given a test string:

teststr= 'chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'

I want to create a list of results like this:

result=['chapter 1 Here is a block of text from chapter one.','chapter 2 Here is another block of text from the second chapter.','chapter 3 Here is the third and final block of text.']

Using re.findall('chapter [0-9]',teststr)

I get ['chapter 1', 'chapter 2', 'chapter 3']

That's fine if all I wanted were the chapter numbers, but I want the chapter number plus all the text up to the next chapter number. In the case of the last chapter, I want to get the chapter number and the text all the way to the end.

Trying re.findall('chapter [0-9].*',teststr) yields the greedy result: ['chapter 1 Here is a block of text from chapter one. chapter 2 Here is another block of text from the second chapter. chapter 3 Here is the third and final block of text.']

I'm not great with regular expressions so any help would be appreciated.

pattern = re.compile(r'chapter (?:(?!\s+chapter \d+).)+') and use pattern.findall — Chris Charley
– Chris Charley, Commented Mar 12, 2020 at 19:00
You could improve your example by adding some text at the beginning that did not include "chapter". To identify a match must "chapter be followed by one space, one or more digits then at least one space? Can "chapter" be "Chapter"? These question arise from the fact that you are asking a question in terms of a single example. That rarely makes the question unabiguous. You need to state your question in words, precisely and unambiguously, then use one or more examples for illustration... — Cary Swoveland
– Cary Swoveland, Commented Mar 12, 2020 at 19:06
..Here's an example of a possible statement of the question that is intended to be complete and unambiguous (but is only my guess of what you want): "I wish to extract all strings that begin '[cC]hapter d+ ', where '[cC]' represents a 'c' or a 'C' and 'd+' represents one or more digits, and ends with a period, followed by zero or more spaces followed by the end of the string or another string '[cC]hapter d+ '". — Cary Swoveland
– Cary Swoveland, Commented Mar 12, 2020 at 19:07
To make case insensitive, pattern = re.compile(r'(?i)chapter (?:(?!\s+chapter \d+).)+') and then use matches = pattern.findall(teststr) — Chris Charley
– Chris Charley, Commented Mar 12, 2020 at 19:24
Maybe re.split(r'(?!^)(?=chapter \d)', teststr) is enough? See the Python demo. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Mar 12, 2020 at 22:51

Wiktor Stribiżew · Accepted Answer · 2022-02-08 22:29:27Z

In general, an extraction regex looks like

(?s)pattern.*?(?=pattern|$)

Or, if the pattern is at the start of a line,

(?sm)^pattern.*?(?=\npattern|\Z)

Here, you could use

re.findall(r'chapter [0-9].*?(?=chapter [0-9]|\Z)', text)

See this regex demo. Details:

chapter [0-9] - chapter + space and a digit
.*? - any zero or more chars, as few as possible
(?=chapter [0-9]|\Z) - a positive lookahead that matches a location immediately followed with chapter, space, digit, or end of the whole string.

Here, since the text starts with the keyword, you may use

import re
teststr= 'chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'
my_result = [x.strip() for x in re.split(r'(?!^)(?=chapter \d)', teststr)]
print( my_result )
# => ['chapter 1 Here is a block of text from chapter one.', 'chapter 2 Here is another block of text from the second chapter.', 'chapter 3 Here is the third and final block of text.']

See the Python demo. The (?!^)(?=chapter \d) regex means:

(?!^) - find a location that is not at the start of string and
(?=chapter \d) - is immediately followed with chapter, space and any digit.

The pattern is used to split the string at the found locations, and does not consume any chars, hence, the results are stripped from whitespace in a list comprehension.

Ed Ward · Accepted Answer · 2020-03-12 19:03:25Z

If you don't have to use a regex, try this:

def split(text):
    chapters = []

    this_chapter = ""
    for i, c in enumerate(text):
        if text[i:].startswith("chapter ") and text[i+8].isdigit():
            if this_chapter.strip():
                chapters.append(this_chapter.strip())
            this_chapter = c
        else:
            this_chapter += c

    chapters.append(this_chapter.strip())

    return chapters

print(split('chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'))

Output:

['chapter 1 Here is a block of text from chapter one.', 'chapter 2 Here is another block of text from the second chapter.', 'chapter 3 Here is the third and final block of text.']

Juan C · Accepted Answer · 2020-03-12 19:24:23Z

0

You're looking for re.split. Assuming up to 99 chapters:

import re
teststr= 'chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'

chapters = [i.strip() for i in re.split('chapter \d{1,2}', teststr)[1:]]

Output:

['Here is a block of text from chapter one.',
 'Here is another block of text from the second chapter.',
 'Here is the third and final block of text.']

edited Mar 12, 2020 at 19:24

answered Mar 12, 2020 at 18:47

Juan C

6,1484 gold badges27 silver badges64 bronze badges

1 Comment

Juan C Over a year ago

Now it is fixed

Collectives™ on Stack Overflow

In Python, how do I extract multiple blocks of text that begin with same pattern, but no distinct end?

3 Answers 3

Comments

Comments

Output:

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Output:

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related