Regex Expression For a String

Question

I want to split the string in python.

Sample string:

Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more

into the following list:

['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE2', 'and this is', 'ACT II. SCENE 1',
 'and' , 'SCENE 2', 'and more']

Can someone help me build the regex? The one that I have built is:

(ACT [A-Z]+.\sSCENE\s[0-9]+)]?(.*)(SCENE [0-9]+)

But this is not working properly.

41686d6564 · Accepted Answer · 2019-11-07 07:45:09Z

2

If I understand your requirements correctly, you may use the following pattern:

(?:ACT|SCENE).+?\d+|\S.*?(?=\s?(?:ACT|SCENE|$))

Demo.

Breakdown:

(?:                    # Start of a non-capturing group.
    ACT|SCENE          # Matches either 'ACT' or 'SCENE'.
)                      # Close the non-capturing group.
.+?                    # Matches one or more characters (lazy matching).
\d+                    # Matches one or more digits.
|                      # Alternation (OR).
\S                     # Matches a non-whitespace character (to trim spaces).
.*?                    # Matches zero or more characters (lazy matching).
(?=                    # Start of a positive Lookahead (i.e., followed by...).
    \s?                # An optional whitespace character (to trim spaces).
    (?:ACT|SCENE|$)    # Followed by either 'ACT' or 'SCENE' or the end of the string.
)                      # Close the Lookahead.

Python example:

import re

regex = r"(?:ACT|SCENE).+?\d+|\S.*?(?=\s?(?:ACT|SCENE|$))"
test_str = "Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more"

list = re.findall(regex, test_str)
print(list)

Output:

['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1', 'and', 'SCENE 2', 'and more']

Try it online.

edited Nov 7, 2019 at 7:45

answered Nov 7, 2019 at 7:21

41686d6564

19.8k13 gold badges48 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user12336036 Over a year ago

@TimBiegeleisen yes I want the regex to identify only ACT I. SCENE 1, SCENE 2, ACT II. SCENE 1, SCENE 2 and every other thing at the start or end or in between to appear as different elements in list

41686d6564 Over a year ago

@Tim Naturally! However, that's what the OP used in the pattern in the post. So, I assume it's what they want to use to split the string.

user12336036 Over a year ago

it want it only to identify CAPS (ACT 1. SCENE 1) together and (SCENE 2) individually and everything between them or at front or end as one element.

Tim Biegeleisen · Accepted Answer · 2019-11-07 07:14:47Z

Here is a working script, albeit a bit hackish:

inp = "Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more"
parts = re.findall(r'[A-Z]{2,}(?: [A-Z0-9.]+)*|(?![A-Z]{2})\w+(?: (?![A-Z]{2})\w+)*', inp)
print(parts)

This prints:

['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1',
 'and', 'SCENE 2', 'and more']

An explanation of the regex logic, which uses an alternation to match one of two cases:

[A-Z]{2,}              match TWO or more capital letters
(?: [A-Z0-9.]+)*       followed by zero or more words, consisting only of
                       capital letters, numbers, or period
|                      OR
(?![A-Z]{2})\w+        match a word which does NOT start with two capital letters
(?: (?![A-Z]{2})\w+)*  then match zero or more similar terms

Ajax1234 · Accepted Answer · 2019-11-07 13:24:42Z

0

You can use re.findall:

import re
s = 'Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more'
new_s = list(map(str.strip, re.findall('[A-Z\d\s\.]{2,}|^[A-Z]{1}[a-z\s]+|[a-z\s]+', s)))

Output:

['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1', 'and', 'SCENE 2', 'and more']

answered Nov 7, 2019 at 13:24

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

Collectives™ on Stack Overflow

Regex Expression For a String

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related