6

I have long strings such as

"123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"

and

"321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"

I want to split them based on the pattern "a number, a space, a dash, a space, some string until the next number, a space, a dash, a space or end of string". Notice that the string may contain commas, ampersands, '>' and other special characters, so splitting on them will not work. I think there is a way in Python to split based on regular expressions but I have trouble forming that.

I have a very introductory knowledge of regular expressions. I can form a regex for numbers, as well as for alphanumeric strings, but I don't know how to specify "take everything until the next number starts".


Update: Expected output:

first case:

["123 - Footwear", "5678 - Apparel, Accessories & Luxury Goods", "9876 - Leisure Products"]

second case:

["321 - Apparel & Accessories", "4321 - Apparel & Accessories > Handbags, Wallets & Cases", "187 - Apparel & Accessories > Shoes"]

4
  • 1
    I suggest checking out the excellent website regular-expressions.info where you can learn much, much more about them and answer your own question. Commented Jul 26, 2018 at 8:31
  • Please add an expected result Commented Jul 26, 2018 at 8:33
  • @jhole89 did that. Commented Jul 26, 2018 at 8:41
  • @TapalGoosal Try my solution. If your categories can contain digits, you can't rely on whitelisting or \D+. Commented Jul 26, 2018 at 8:48

4 Answers 4

7

Here is the pattern, first there is some number so we use [0-9]+ followed by string and special characters like & - >, therefore we can use [a-zA-Z \-&>]+:

>>> str_ = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

Another string you mentioned in OP

>>> str_ = "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"
>>> re.findall(r'(?is)([0-9]+[a-zA-Z \-&>,]+)', str_)
['321 - Apparel & Accessories, ', 
 '4321 - Apparel & Accessories > Handbags, Wallets & Cases, ', 
 '187 - Apparel & Accessories > Shoes']
Sign up to request clarification or add additional context in comments.

4 Comments

Shouldn't you add a ',' in your regex, when matching special characters ? Such as : (?is)([0-9]+[a-zA-Z \-&>,]+). This way you'll match items that are behind a comma in the strings given in OP
The strings may contain commas.
For the first example I would like the second element in the list to be "5678 - Apparel, Accessories & Luxury Goods".
Your second string demo results still include commas.
3

If numbers appear only at the beginning of each segment of strings, you can do:

import re
for s in "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes":
    print(re.findall(r'\d+\D+(?=,\s*\d|$)', s))

This outputs:

['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']

This regex pattern uses \d+ to match numbers first, then uses \D+ to match non-numbers, and uses lookahead pattern (?=,\s*\d|$) to make sure that the non-numbers stops at the point where it's followed by either a comma, some spaces and another number, or the end of the string, so that the resulting match won't include a trailing comma and a space.

3 Comments

Note that \s* at the end makes no sense since \D+ already consumes all those whitespaces.
Yes, can you also briefly explain?
@TapalGoosal Just added some explanations to my answer.
2

You may match substrings starting with one or more digits followed with 1+ whitespaces, -, 1+ whitespaces and ending with the same pattern or end of string.

re.findall(r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)", s, re.S)

See the regex demo

Note: If the leading number length is more than one, say, it is at least a 2-digit number, you may replace the \d+ with \d{2,}, etc. Adjust as you see fit.

Regex demo

  • \d+ - 1+ digits
  • \s+-\s+ - a - enclosed with 1+ whitespaces
  • .*? - any 0+ chars, as few as possible, up to the location in string that is followed with...
  • (?=\s*(?:,\s*)?\d+\s+-\s|\Z) - (a positive lookahead):
    • \s*(?:,\s*)?\d+\s+-\s - 0+ whitespaces, an optional substringof a comma and 0+ whitespaces after it, 1+ digits, 1+ whitespaces, - and a whitespace
    • | - or
    • \Z - end of string

Python demo:

import re

rx = r"\d+\s+-\s+.*?(?=\s*(?:,\s*)?\d+\s+-\s|\Z)"
texts = ["123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products", "321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes"]
for s in texts:
    print("--- {} ----".format(s))
    print(re.findall(rx, s, re.S))

Output:

--- 123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products ---
['123 - Footwear', '5678 - Apparel, Accessories & Luxury Goods', '9876 - Leisure Products']
--- 321 - Apparel & Accessories, 4321 - Apparel & Accessories > Handbags, Wallets & Cases, 187 - Apparel & Accessories > Shoes ---
['321 - Apparel & Accessories', '4321 - Apparel & Accessories > Handbags, Wallets & Cases', '187 - Apparel & Accessories > Shoes']

2 Comments

What is re.S?
@TapalGoosal It is the same as re.DOTALL, a modifier that makes . match line break chars, too. By default, a . does not match line breaks.
2

Surely it is as simple as just splitting when you encounter a numeric?

s = "123 - Footwear, 5678 - Apparel, Accessories & Luxury Goods, 9876 - Leisure Products"
re.findall(r'\d+\D+', s) 

['123 - Footwear, ',
 '5678 - Apparel, Accessories & Luxury Goods, ',
 '9876 - Leisure Products']

2 Comments

Yes, can you also briefly explain?
Sure, as we know that the pattern always starts with numerics (and that there are no other numbers in the string we want to capture). \d means one digit, \d+ means any number of digits, \D means one non digit. So \d+\D+ means any number of digits followed by any number of non-digits.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.