2

I have some very ugly data I am trying to massage. It consist of SKUs and I want to group them into product line. E.g.:

PRODUCT_ID
----------
313L30WHITE
313L40WHITE
313L30BLACK
3333
2L10RED
2L20BLACK
32341/30/BLK

Basically, I want to group items by the first numeric characters in the PRODUCT_ID field. I.e., all the characters up to the first non-numeric character. E.g.:

PRODUCT_ID    GROUP
----------    -----
313L30WHITE   313
313L40WHITE   313
313L30BLACK   313
3333          3333
2L10RED       2
2L20BLACK     2
32341/30/BLK  32341

Seems like a SQL solution would not be elegant. Because of that, I would prefer a Python solution that creates a new table with a new GROUP column.

Anyone have any suggestions?

2
  • 1
    solving this in mysql is rather unclean in my opinion. but a question regarding the more general issue of extracting numbers from strings in a mysql query was asked here: stackoverflow.com/questions/978147/… Commented Apr 2, 2012 at 0:27
  • After looking at that answer, I agree that the SQL solution would not be elegant. Modified question. Commented Apr 2, 2012 at 0:30

2 Answers 2

3

If you know that PRODUCT_ID will always start with one or more numeric characters, then you can just convert it to a number by adding 0:

select PRODUCT_ID,
       0 + PRODUCT_ID as GROUP
  from ...

See §11.2 "Type Conversion in Expression Evaluation" in the MySQL 5.6 Reference Manual.

If you want GROUP to be textual rather than numeric, then you can write:

select PRODUCT_ID,
       concat(0 + PRODUCT_ID) as GROUP
  from ...
Sign up to request clarification or add additional context in comments.

1 Comment

Hermes: "Sweet type coercion of Jupiter!" +1 for completely unintuitive but efficient solution.
2

this is a perfect place for regex...

import re
RE=re.compile(r'\d+')
#Set up the list of SKU's
...
List_of_SKUs.sort(key=lambda x:int(RE.match(x).group()))

Now your list is sorted.

The regex just pulls off the longest integer at the start of the string. The lambda function just accesses that portion of the string and casts it to an integer which is used for sorting.

EDIT

From there, if you want to print the table, you could do something like:

for item in List_of_SKUs:
    print "%s\t%s"%(item,RE.match(item).group())

Although there is probably a more efficient way of doing this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.