1

I am trying to create a regex for an ID with the following rules:

  1. Starts with A-Z, one or more times. (Main ID, mi)
  2. Followed with an optional dash. (delimiter)
  3. Followed with 0-9, one or more times. (Sub ID, si)
  4. Followed with an optional dash or dot. (delimiter)
  5. Followed with an optional a-z or 0-9, one or more times. (Main category, mc)
  6. Followed with an optional dash or dot. (delimiter)
  7. Followed with an optional a-z or 0-9, one or more times. (Sub category, sc)

The delimiters can be omitted if the ID is alternating alpha and numeric (A-01a1, A1.a.1). Delimiters is required if succeeding parts are both alpha or both numeric (A-1.1a, A1.2.3, A1a.a).

Here is what I have:

(?P<mi>[A-Z]+)-?(?P<si>[0-9]+)[\-\.]?(?P<mc>[a-z0-9])*[\-\.]?(?P<sc>[a-z0-9])*

Here is the result when I tried it:

ID      mi  si  mc  sc
A1      A   1
A001    A   001
AB-01   AB  01
A1aa    A   1   a      <<<<< mc=aa
A-01a1  A   01  1      <<<<< mc=a sc=1
A-1.1a  A   1   a      <<<<< mc=1 sc=a
A1.a1   A   1   1      <<<<< mc=a sc=1
A1.a.1  A   1   a   1
A1.2.3  A   1   2   3
A1a.a   A   1   a   a
4
  • Which language are you using? Regex facilities are different in different languages (though this definitely looks like a PCRE variant). Commented Jun 29, 2016 at 5:00
  • I just use RegEx101.com Commented Jun 29, 2016 at 6:28
  • It wants to know too. Out of the box the PCRE variant seems to be selected, but you can click along the left edge to make it use JS or Python dialects. There are obviously many more which it doesn't support. Commented Jun 29, 2016 at 7:46
  • PCRE (default I guess). I did not know that it has JS and Python variant on the left side. I just open that site for testing. Commented Jul 19, 2016 at 2:00

3 Answers 3

2

Description

(?<=&|^)xxx=true^(?P<MainID>[a-z]+)-?(?<SubID>[0-9]+)(?:[-.]?(?P<MainCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+))?(?:[-.]?(?P<SubCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+))?

Regular expression visualization

** To see the image better, simply right click the image and select view in new window

The regex does the following:

  • Starts with A-Z, one or more times. (Main ID, mi)
  • Followed with an optional dash. (delimiter)
  • Followed with 0-9, one or more times. (Sub ID, si)
  • Followed with an optional dash or dot. (delimiter)
  • Followed with an optional a-z or 0-9, one or more times. (Main category, mc)
  • Followed with an optional dash or dot. (delimiter)
  • Followed with an optional a-z or 0-9, one or more times. (Sub category, sc)

  • If a group of text is surrounded by delimiters or the end of the string then the characters are allowed to alternate between letters and numbers for the same capture group

  • If the string is not surrounded by delimiters then the only letters or numbers are allowed to be captured

Example

Live Demo

https://regex101.com/r/uH7zF3/1

Sample text

ID      mi  si  mc  sc
A1      A   1
A001    A   001
AB-01   AB  01
A1aa    A   1   a      <<<<< mc=aa
A-01a1  A   01  1      <<<<< mc=a sc=1
A-1.1a  A   1   a      <<<<< mc=1 sc=a
A1.a1   A   1   1      <<<<< mc=a sc=1
A1.a.1  A   1   a   1
A1.2.3  A   1   2   3
A1a.a   A   1   a   a

Sample Matches

MATCH 1
MainID  [24-25] `A`
SubID   [25-26] `1`

MATCH 2
MainID  [38-39] `A`
SubID   [39-42] `001`

MATCH 3
MainID  [54-56] `AB`
SubID   [57-59] `01`

MATCH 4
MainID  [69-70] `A`
SubID   [70-71] `1`
MainCategory    [71-73] `aa`

MATCH 5
MainID  [104-105]   `A`
SubID   [106-108]   `01`
MainCategory    [108-109]   `a`
SubCategory [109-110]   `1`

MATCH 6
MainID  [143-144]   `A`
SubID   [145-146]   `1`
MainCategory    [147-149]   `1a`

MATCH 7
MainID  [182-183]   `A`
SubID   [183-184]   `1`
MainCategory    [185-187]   `a1`

MATCH 8
MainID  [221-222]   `A`
SubID   [222-223]   `1`
MainCategory    [224-225]   `a`
SubCategory [226-227]   `1`

MATCH 9
MainID  [243-244]   `A`
SubID   [244-245]   `1`
MainCategory    [246-247]   `2`
SubCategory [248-249]   `3`

MATCH 10
MainID  [265-266]   `A`
SubID   [266-267]   `1`
MainCategory    [267-268]   `a`
SubCategory [269-270]   `a`

Explanation

^ assert position at start of a line
(?P<MainID>[a-z]+) Named capturing group MainID
[a-z]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
-? matches the character - literally
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
(?<SubID>[0-9]+) Named capturing group SubID
[0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
(?:[-.]?(?P<MainCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+))? Non-capturing group
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[-.]? match a single character present in the list below
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
-. a single character in the list -. literally
(?P<MainCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+) Named capturing group MainCategory
1st Alternative: (?<=[-.])[a-z0-9]+(?=[-.\s])
(?<=[-.]) Positive Lookbehind - Assert that the regex below can be matched
[-.] match a single character present in the list below
-. a single character in the list -. literally
[a-z0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
0-9 a single character in the range between 0 and 9
(?=[-.\s]) Positive Lookahead - Assert that the regex below can be matched
[-.\s] match a single character present in the list below
-. a single character in the list -. literally
\s match any white space character [\r\n\t\f ]
2nd Alternative: [a-z]+
[a-z]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
3rd Alternative: [0-9]+
[0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
(?:[-.]?(?P<SubCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+))? Non-capturing group
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[-.]? match a single character present in the list below
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
-. a single character in the list -. literally
(?P<SubCategory>(?<=[-.])[a-z0-9]+(?=[-.\s])|[a-z]+|[0-9]+) Named capturing group SubCategory
1st Alternative: (?<=[-.])[a-z0-9]+(?=[-.\s])
(?<=[-.]) Positive Lookbehind - Assert that the regex below can be matched
[-.] match a single character present in the list below
-. a single character in the list -. literally
[a-z0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
0-9 a single character in the range between 0 and 9
(?=[-.\s]) Positive Lookahead - Assert that the regex below can be matched
[-.\s] match a single character present in the list below
-. a single character in the list -. literally
\s match any white space character [\r\n\t\f ]
2nd Alternative: [a-z]+
[a-z]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
3rd Alternative: [0-9]+
[0-9]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
Sign up to request clarification or add additional context in comments.

2 Comments

It seems like fails only when on list but not if it is by itself
For A-1-aa1 the aa1 part should be the main category? In this regex101.com/r/uH7zF3/3 I modified the (?=[-.\s]) to also test for the end of a string, so now they look like (?=[-.\s]|$).
0

I would use this one:

(?<mi>[A-Z]+)-?(?<si>[0-9]+)[-.]?(?<mc>[a-z0-9]*)[-.]?(?<sc>[a-z0-9]*)

Comments

0

The * in your expression should be relocated to the inside of your capture groups

Also you can remove the slashes inside the character case

(?P<mi>[A-Z]+)-?(?P<si>[0-9]+)[\-\.]?(?P<mc>[a-z0-9])*[\-\.]?(?P<sc>[a-z0-9])*
                               ^ ^                   ^ ^ ^                   ^ 

Should look like:

(?P<mi>[A-Z]+)-?(?P<si>[0-9]+)[-.]?(?P<mc>[a-z0-9]*)[-.]?(?P<sc>[a-z0-9]*)

1 Comment

fails at A-01a1, A-1-aa1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.