53

Is it possible to skip a couple of characters in a capture group in regular expressions? I am using .NET regexes but that shouldn't matter.

Basically, what I am looking for is:

[random text]AB-123[random text]

and I need to capture 'AB123', without the hyphen.

I know that AB is 2 or 3 uppercase characters and 123 is 2 or 3 digits, but that's not the hard part. The hard part (at least for me) is skipping the hyphen.

I guess I could capture both separately and then concatenate them in code, but I wish I had a more elegant, regex-only solution.

Any suggestions?

3
  • 1
    in javascript you could: /(AB)\-(123))/.exec("[random text]AB-123[random text]"); its now return array [1] and [2] ^^ Commented Mar 26, 2015 at 11:01
  • What about using positive lookahead (?=) and positive lookbehind (?<=)? Basically this: (?<=\')([A-Z]{2}-[0-9]{3})(?=\') should work. Commented Jun 1, 2015 at 7:26
  • ^ unfortunately, that captures the dash Commented Jul 21, 2023 at 5:24

6 Answers 6

57

In short: You can't. A match is always consecutive, even when it contains things as zero-width assertions there is no way around matching the next character if you want to get to the one after it.

Sign up to request clarification or add additional context in comments.

2 Comments

You can using positive lookbehind and positive lookahead
True. But lookaround does not match anything. The position regex engine's position in the sting does not change.
21

There really isn't a way to create an expression such that the matched text is different than what is found in the source text. You will need to remove the hyphen in a separate step either by matching the first and second parts individually and concatenating the two groups:

match = Regex.Match( text, "([A-B]{2,3})-([0-9]{2,3})" );
matchedText = string.Format( "{0}{1}", 
    match.Groups.Item(1).Value, 
    match.Groups.Item(2).Value );

Or by removing the hyphen in a step separate from the matching process:

match = Regex.Match( text, "[A-B]{2,3}-[0-9]{2,3}" );
matchedText = match.Value.Replace( "-", "" );

1 Comment

There's also match.Result("$1$2")
4

Your assertion that its not possible to do without sub-grouping + concatentating it is correct.

You could also do as Jeff-Hillman and merely strip out the bad character(s) after the fact.

Important to note here tho, is you "dont use regex for everything".

Regex is designed for less complicated solutions for non-trivial problems, and you shouldn't use "oh, we'll use a regex" for everything, and you shoudn't get into the habbit of thinking you can solve the problem in a one-step regex.

When there is a viable trivial method that works, by all means, use it.

An alternative Idea, if you happen to be needing to return multiple matches in a body of code is look for your languages "callback" based regex, which permits passing any matched/found group to a function call which can do in-line substitution. ( Especially handy in doing regexp replaces ).

Not sure how it would work in .Net, but in php you would do something like ( not exact code )

  function strip_reverse( $a )
  {
     $a = preg_replace("/-/", "", $a );
     return reverse($a);
  }
  $b = preg_replace_callback( "/(AB[-]?cde)/" , 'strip_reverse' , "Hello World AB-cde" ; 

4 Comments

It is a common misunderstanding that regex is for "less complicated siutations" only. Regex is immensely powerful and con solve really complex stuff. Regex is just not the right tool for things that are not regular. It's simple: There are things that work with regex, and there are those that don't.
yes, but theres a prolific /overuse/ of regex in situations where the solution is using a firearm to holepunch paper. it'll work, but there are complications that don't exist in the simpler solution. The key is knowing when not to use regex ;)
Knowing when to use which tool is always the key. I would probably avoid using regex in a long loop when there was another way (say, "indexOf" plus a little math).
For those cases there is the "study regex" optimisation which makes a memory tree to boost regex matching ;)
4

You can use nested capture groups, like this:

((AB)-(123))

The first capture group is AB-123, the second is AB, and the third is 123. Then all you would have to do is join the second and third group with a space.

Comments

1

I am kind of new to this, but you could use the vertical bar symbol |, which acts as an OR.

This could work for .NET:

((?<=[A-Z]{2}-)\d\d\d)|([A-Z]{2}(?=-\d\d\d))

This works for me in a VIM syntax file:

\(\([A-Z]\{2}-\)\@<=\d\d\d\)\|\([A-Z]\{2}\(-\d\d\d\)\@=\)

Comments

0

Kind of late, but I think I figured this one out. At least one way to do it.

I used positive lookahead to stop at the # sign in my text. I didn't want the space or the # sign, so I had to figure a way out to "skip" over them. So when I was forced to match them again, I dumped them into a garbage group that I didn't plan on using (.ie, a bit bucket) which in the code is <garb1>. Now, my place pointer is one character position beyond the # sign (where I want to be, skipping the space and the # sign). And I now just match to the end of the file name at the . and ignore the file extension.

(?i)English\\(?<Series>[^ ]+) - (?<Title>.+(?= #))(?<garb1>..)(?<Number>[^.]+)(?-i)

The Filename this was used on is

F:\Downloads\Downloads\500 Comics CCC CBR English\Isukani - Great Girl #01.cbr

1 Comment

I feel like this one is returning 2 DIFFERENT captures: $Match.Title and $Match.Number, rather than just skipping the undesired character.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.