Splitting a string based on a pattern in Python

Question

I have long strings such as

"123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"

and

"321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"

I want to split them based on the pattern "a number, a space, a dash, a space, some string until the next number, a space, a dash, a space or end of string". Notice that the string may contain commas, ampersands, '>' and other special characters, so splitting on them will not work. I think there is a way in Python to split based on regular expressions but I have trouble forming that.

I have a very introductory knowledge of regular expressions. I can form a regex for numbers, as well as for alphanumeric strings, but I don't know how to specify "take everything until the next number starts".

Update: Expected output:

first case:

["123 - Footwear", "5678 - Apparel, Accessories & Luxury Goods", "9876 - Leisure Products"]

second case:

["321 - Apparel & Accessories", "4321 - Apparel & Accessories > Handbags, Wallets & Cases", "187 - Apparel & Accessories > Shoes"]

I suggest checking out the excellent website regular-expressions.info where you can learn much, much more about them and answer your own question. — i alarmed alien
– i alarmed alien, Commented Jul 26, 2018 at 8:31
@TapalGoosal Try my solution. If your categories can contain digits, you can't rely on whitelisting or \D+. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 26, 2018 at 8:48

Wiktor Stribiżew · Accepted Answer · 2018-07-26 08:58:39Z

7

Here is the pattern, first there is some number so we use [0-9]+ followed by string and special characters like & - >, therefore we can use [a-zA-Z \-&>]+:

>>> str_ = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

Another string you mentioned in OP

>>> str_ = "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['321 - Apparel & Accessories, ', 
 '4321 - Apparel & Accessories > Handbags, Wallets & Cases, ', 
 '187 - Apparel & Accessories > Shoes']

edited Jul 26, 2018 at 8:58

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

answered Jul 26, 2018 at 8:33

akash karothiya

5,9601 gold badge21 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Charles Over a year ago

Shouldn't you add a ',' in your regex, when matching special characters ? Such as : (?is)([0-9]+[a-zA-Z \-&>,]+). This way you'll match items that are behind a comma in the strings given in OP

blhsing Over a year ago

The strings may contain commas.

Tapal Goosal Over a year ago

For the first example I would like the second element in the list to be "5678 - Apparel, Accessories & Luxury Goods".

Wiktor Stribiżew Over a year ago

Your second string demo results still include commas.

blhsing · Accepted Answer · 2018-07-26 09:21:11Z

3

If numbers appear only at the beginning of each segment of strings, you can do:

import re
for s in "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes":
    print(re.findall(r'\d+\D+(?=,\s*\d|$)', s))

This outputs:

['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']

This regex pattern uses \d+ to match numbers first, then uses \D+ to match non-numbers, and uses lookahead pattern (?=,\s*\d|$) to make sure that the non-numbers stops at the point where it's followed by either a comma, some spaces and another number, or the end of the string, so that the resulting match won't include a trailing comma and a space.

edited Jul 26, 2018 at 9:21

answered Jul 26, 2018 at 8:45

blhsing

109k9 gold badges88 silver badges132 bronze badges

3 Comments

Wiktor Stribiżew Over a year ago

Note that \s* at the end makes no sense since \D+ already consumes all those whitespaces.

Tapal Goosal Over a year ago

Yes, can you also briefly explain?

blhsing Over a year ago

@TapalGoosal Just added some explanations to my answer.

Wiktor Stribiżew · Accepted Answer · 2018-07-26 08:45:30Z

2

You may match substrings starting with one or more digits followed with 1+ whitespaces, -, 1+ whitespaces and ending with the same pattern or end of string.

re.findall(r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)", s, re.S)

See the regex demo

Note: If the leading number length is more than one, say, it is at least a 2-digit number, you may replace the \d+ with \d{2,}, etc. Adjust as you see fit.

Regex demo

\d+ - 1+ digits
\s+-\s+ - a - enclosed with 1+ whitespaces
.*? - any 0+ chars, as few as possible, up to the location in string that is followed with...
(?=\s*(?:,\s*)?\d+\s+-\s|\Z) - (a positive lookahead):
- \s*(?:,\s*)?\d+\s+-\s - 0+ whitespaces, an optional substringof a comma and 0+ whitespaces after it, 1+ digits, 1+ whitespaces, - and a whitespace
- | - or
- \Z - end of string

Python demo:

import re

rx = r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)"
texts = ["123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"]
for s in texts:
    print("--- {} ----".format(s))
    print(re.findall(rx, s, re.S))

Output:

--- 123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products ---
['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
--- 321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes ---
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']

edited Jul 26, 2018 at 8:45

answered Jul 26, 2018 at 8:39

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

2 Comments

Tapal Goosal Over a year ago

What is re.S?

Wiktor Stribiżew Over a year ago

@TapalGoosal It is the same as re.DOTALL, a modifier that makes . match line break chars, too. By default, a . does not match line breaks.

jhole89 · Accepted Answer · 2018-07-26 08:53:16Z

2

Surely it is as simple as just splitting when you encounter a numeric?

s = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
re.findall(r'\d+\D+', s) 

['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

answered Jul 26, 2018 at 8:53

jhole89

83810 silver badges28 bronze badges

2 Comments

Tapal Goosal Over a year ago

Yes, can you also briefly explain?

jhole89 Over a year ago

Sure, as we know that the pattern always starts with numerics (and that there are no other numbers in the string we want to capture). \d means one digit, \d+ means any number of digits, \D means one non digit. So \d+\D+ means any number of digits followed by any number of non-digits.

Collectives™ on Stack Overflow

Splitting a string based on a pattern in Python

4 Answers 4

4 Comments

3 Comments

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related