Regular expression to skip character in capture group

Question

Is it possible to skip a couple of characters in a capture group in regular expressions? I am using .NET regexes but that shouldn't matter.

Basically, what I am looking for is:

[random text]AB-123[random text]

and I need to capture 'AB123', without the hyphen.

I know that AB is 2 or 3 uppercase characters and 123 is 2 or 3 digits, but that's not the hard part. The hard part (at least for me) is skipping the hyphen.

I guess I could capture both separately and then concatenate them in code, but I wish I had a more elegant, regex-only solution.

Any suggestions?

in javascript you could: /(AB)\-(123))/.exec("[random text]AB-123[random text]"); its now return array [1] and [2] ^^ — hanshenrik
– hanshenrik, Commented Mar 26, 2015 at 11:01
What about using positive lookahead (?=) and positive lookbehind (?<=)? Basically this: (?<=\')([A-Z]{2}-[0-9]{3})(?=\') should work. — It's me ... Alex
– It's me ... Alex, Commented Jun 1, 2015 at 7:26

Tomalak · Accepted Answer · 2008-11-10 10:38:49Z

57

In short: You can't. A match is always consecutive, even when it contains things as zero-width assertions there is no way around matching the next character if you want to get to the one after it.

answered Nov 10, 2008 at 10:38

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

It's me ... Alex Over a year ago

You can using positive lookbehind and positive lookahead

Tomalak Over a year ago

True. But lookaround does not match anything. The position regex engine's position in the sting does not change.

Jeff Hillman · Accepted Answer · 2008-11-10 10:45:50Z

21

There really isn't a way to create an expression such that the matched text is different than what is found in the source text. You will need to remove the hyphen in a separate step either by matching the first and second parts individually and concatenating the two groups:

match = Regex.Match( text, "([A-B]{2,3})-([0-9]{2,3})" );
matchedText = string.Format( "{0}{1}", 
    match.Groups.Item(1).Value, 
    match.Groups.Item(2).Value );

Or by removing the hyphen in a step separate from the matching process:

match = Regex.Match( text, "[A-B]{2,3}-[0-9]{2,3}" );
matchedText = match.Value.Replace( "-", "" );

answered Nov 10, 2008 at 10:45

Jeff Hillman

7,5783 gold badges33 silver badges34 bronze badges

1 Comment

Alan Moore Over a year ago

There's also match.Result("$1$2")

Kent Fredric · Accepted Answer · 2008-11-10 10:58:36Z

4

Your assertion that its not possible to do without sub-grouping + concatentating it is correct.

You could also do as Jeff-Hillman and merely strip out the bad character(s) after the fact.

Important to note here tho, is you "dont use regex for everything".

Regex is designed for less complicated solutions for non-trivial problems, and you shouldn't use "oh, we'll use a regex" for everything, and you shoudn't get into the habbit of thinking you can solve the problem in a one-step regex.

When there is a viable trivial method that works, by all means, use it.

An alternative Idea, if you happen to be needing to return multiple matches in a body of code is look for your languages "callback" based regex, which permits passing any matched/found group to a function call which can do in-line substitution. ( Especially handy in doing regexp replaces ).

Not sure how it would work in .Net, but in php you would do something like ( not exact code )

  function strip_reverse( $a )
  {
     $a = preg_replace("/-/", "", $a );
     return reverse($a);
  }
  $b = preg_replace_callback( "/(AB[-]?cde)/" , 'strip_reverse' , "Hello World AB-cde" ;

answered Nov 10, 2008 at 10:58

Kent Fredric

57.6k14 gold badges112 silver badges151 bronze badges

4 Comments

Tomalak Over a year ago

It is a common misunderstanding that regex is for "less complicated siutations" only. Regex is immensely powerful and con solve really complex stuff. Regex is just not the right tool for things that are not regular. It's simple: There are things that work with regex, and there are those that don't.

Kent Fredric Over a year ago

yes, but theres a prolific /overuse/ of regex in situations where the solution is using a firearm to holepunch paper. it'll work, but there are complications that don't exist in the simpler solution. The key is knowing when not to use regex ;)

Tomalak Over a year ago

Knowing when to use which tool is always the key. I would probably avoid using regex in a long loop when there was another way (say, "indexOf" plus a little math).

Kent Fredric Over a year ago

For those cases there is the "study regex" optimisation which makes a memory tree to boost regex matching ;)

Alan Moore · Accepted Answer · 2015-11-21 19:09:45Z

4

You can use nested capture groups, like this:

((AB)-(123))

The first capture group is AB-123, the second is AB, and the third is 123. Then all you would have to do is join the second and third group with a space.

edited Nov 21, 2015 at 19:09

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

answered Nov 21, 2015 at 17:49

Steve

491 bronze badge

Comments

rky · Accepted Answer · 2020-06-21 07:32:22Z

1

I am kind of new to this, but you could use the vertical bar symbol |, which acts as an OR.

This could work for .NET:

((?<=[A-Z]{2}-)\d\d\d)|([A-Z]{2}(?=-\d\d\d))

This works for me in a VIM syntax file:

\(\([A-Z]\{2}-\)\@<=\d\d\d\)\|\([A-Z]\{2}\(-\d\d\d\)\@=\)

answered Jun 21, 2020 at 7:32

rky

254 bronze badges

Comments

user8395964 · Accepted Answer · 2024-01-02 10:16:34Z

0

Kind of late, but I think I figured this one out. At least one way to do it.

I used positive lookahead to stop at the # sign in my text. I didn't want the space or the # sign, so I had to figure a way out to "skip" over them. So when I was forced to match them again, I dumped them into a garbage group that I didn't plan on using (.ie, a bit bucket) which in the code is <garb1>. Now, my place pointer is one character position beyond the # sign (where I want to be, skipping the space and the # sign). And I now just match to the end of the file name at the . and ignore the file extension.

(?i)English\\(?<Series>[^ ]+) - (?<Title>.+(?= #))(?<garb1>..)(?<Number>[^.]+)(?-i)

The Filename this was used on is

F:\Downloads\Downloads\500 Comics CCC CBR English\Isukani - Great Girl #01.cbr

edited Jan 2, 2024 at 10:16

user8395964

1462 silver badges11 bronze badges

answered Jan 27, 2018 at 8:03

Logan9773

133 bronze badges

1 Comment

Hicsy Over a year ago

I feel like this one is returning 2 DIFFERENT captures: $Match.Title and $Match.Number, rather than just skipping the undesired character.

Collectives™ on Stack Overflow

Regular expression to skip character in capture group

6 Answers 6

2 Comments

1 Comment

4 Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

1 Comment

4 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related