1

Given a test string:

teststr= 'chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'

I want to create a list of results like this:

result=['chapter 1 Here is a block of text from chapter one.','chapter 2 Here is another block of text from the second chapter.','chapter 3 Here is the third and final block of text.']

Using re.findall('chapter [0-9]',teststr)

I get ['chapter 1', 'chapter 2', 'chapter 3']

That's fine if all I wanted were the chapter numbers, but I want the chapter number plus all the text up to the next chapter number. In the case of the last chapter, I want to get the chapter number and the text all the way to the end.

Trying re.findall('chapter [0-9].*',teststr) yields the greedy result: ['chapter 1 Here is a block of text from chapter one. chapter 2 Here is another block of text from the second chapter. chapter 3 Here is the third and final block of text.']

I'm not great with regular expressions so any help would be appreciated.

11
  • pattern = re.compile(r'chapter (?:(?!\s+chapter \d+).)+') and use pattern.findall Commented Mar 12, 2020 at 19:00
  • You could improve your example by adding some text at the beginning that did not include "chapter". To identify a match must "chapter be followed by one space, one or more digits then at least one space? Can "chapter" be "Chapter"? These question arise from the fact that you are asking a question in terms of a single example. That rarely makes the question unabiguous. You need to state your question in words, precisely and unambiguously, then use one or more examples for illustration... Commented Mar 12, 2020 at 19:06
  • ..Here's an example of a possible statement of the question that is intended to be complete and unambiguous (but is only my guess of what you want): "I wish to extract all strings that begin '[cC]hapter d+ ', where '[cC]' represents a 'c' or a 'C' and 'd+' represents one or more digits, and ends with a period, followed by zero or more spaces followed by the end of the string or another string '[cC]hapter d+ '". Commented Mar 12, 2020 at 19:07
  • To make case insensitive, pattern = re.compile(r'(?i)chapter (?:(?!\s+chapter \d+).)+') and then use matches = pattern.findall(teststr) Commented Mar 12, 2020 at 19:24
  • 1
    Maybe re.split(r'(?!^)(?=chapter \d)', teststr) is enough? See the Python demo. Commented Mar 12, 2020 at 22:51

3 Answers 3

1

In general, an extraction regex looks like

(?s)pattern.*?(?=pattern|$)

Or, if the pattern is at the start of a line,

(?sm)^pattern.*?(?=\npattern|\Z)

Here, you could use

re.findall(r'chapter [0-9].*?(?=chapter [0-9]|\Z)', text)

See this regex demo. Details:

  • chapter [0-9] - chapter + space and a digit
  • .*? - any zero or more chars, as few as possible
  • (?=chapter [0-9]|\Z) - a positive lookahead that matches a location immediately followed with chapter, space, digit, or end of the whole string.

Here, since the text starts with the keyword, you may use

import re
teststr= 'chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'
my_result = [x.strip() for x in re.split(r'(?!^)(?=chapter \d)', teststr)]
print( my_result )
# => ['chapter 1 Here is a block of text from chapter one.', 'chapter 2 Here is another block of text from the second chapter.', 'chapter 3 Here is the third and final block of text.']

See the Python demo. The (?!^)(?=chapter \d) regex means:

  • (?!^) - find a location that is not at the start of string and
  • (?=chapter \d) - is immediately followed with chapter, space and any digit.

The pattern is used to split the string at the found locations, and does not consume any chars, hence, the results are stripped from whitespace in a list comprehension.

Sign up to request clarification or add additional context in comments.

Comments

0

If you don't have to use a regex, try this:

def split(text):
    chapters = []

    this_chapter = ""
    for i, c in enumerate(text):
        if text[i:].startswith("chapter ") and text[i+8].isdigit():
            if this_chapter.strip():
                chapters.append(this_chapter.strip())
            this_chapter = c
        else:
            this_chapter += c

    chapters.append(this_chapter.strip())

    return chapters

print(split('chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'))

Output:

['chapter 1 Here is a block of text from chapter one.', 'chapter 2 Here is another block of text from the second chapter.', 'chapter 3 Here is the third and final block of text.']

Comments

0

You're looking for re.split. Assuming up to 99 chapters:

import re
teststr= 'chapter 1 Here is a block of text from chapter one.  chapter 2 Here is another block of text from the second chapter.  chapter 3 Here is the third and final block of text.'

chapters = [i.strip() for i in re.split('chapter \d{1,2}', teststr)[1:]]

Output:

['Here is a block of text from chapter one.',
 'Here is another block of text from the second chapter.',
 'Here is the third and final block of text.']

1 Comment

Now it is fixed

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.