9

The following code...

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var r = new Regex("(.*)");
        var c = "XYZ";
        var uc = r.Replace(c, "A $1 B");

        Console.WriteLine(uc);
    }
}

.Net Fiddle Link

produces the following output...

A XYZ BA B

Do you think this is correct?

Shouldn't the output be...

A XYZ B

I think I am doing something stupid here. I would appreciate any help you can provide in helping me understand this issue.


Here is something interesting...

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var r = new Regex("(.*)");
        var c = "XYZ";
        var uc = r.Replace(c, "$1");

        Console.WriteLine(uc);
    }
}

.Net Fiddle

Output...

XYZ

9
  • 1
    You regex has two matches and Replace will replace both of them. The first is "XYZ" and the second is an empty string. What I'm not sure of is why it has two matches in the first place. You can fix it with ^(.*)$ to force it to consider the beginning and end of the string. Commented Jan 24, 2014 at 14:11
  • 1
    You'll always get an extra match at the end of a string if your pattern matches an empty string with no further restrictions. Commented Jan 24, 2014 at 14:12
  • 5
    The First Rule of Programming: It's Always Your Fault Commented Jan 24, 2014 at 14:13
  • 1
    See stackoverflow.com/questions/16103346/… for extra match on the end of the string explanation Commented Jan 24, 2014 at 14:13
  • 2
    @w0lf: Probably different depending on regex engine. Ruby's engine seems to interpret it the same way (see matches: rubular.com/r/cRaG0rPowZ) Commented Jan 24, 2014 at 14:20

5 Answers 5

5

As for why the engine returns 2 matches, it is due to the way .NET (also Perl and Java) handles global matching, i.e. find all matches to the given pattern in an input string.

The process can be described as followed (current index is usually set to 0 at the beginning of a search, unless specified):

  1. From the current index, perform a search.
  2. If there is no match:
    1. If current index already points at the end of the string (current index >= string.length), return the result so far.
    2. Increment current index by 1, go to step 1.
  3. If the main match ($0) is non-empty (at least one character is consumed), add the result and set current index to the end of main match ($0). Then go to step 1.
  4. If the main match ($0) is empty:
    1. If the previous match is non-empty, add the result and go to step 1.
    2. If the previous match is empty, backtrack and continue searching.
    3. If the backtracking attempt finds a non-empty match, add the result, set current index to the end of the match and go to step 1.
    4. Otherwise, increment current index by 1. Go to step 1.

The engine needs to check for empty match; otherwise, it will end up in an infinite loop. The designer recognizes the usage of empty match (in splitting a string into characters, for example), so the engine must be designed to avoid getting stuck at a certain position forever.

This process explains why there is an empty match at the end: since a search is conducted at the end of the string (index 3) after (.*) matches abc, and (.*) can match an empty string, an empty match is found. And the engine does not produce infinite number of empty matches, since an empty match has already been found at the end.

 a b c
^ ^ ^ ^
0 1 2 3

First match:

 a b c
^     ^
0-----3

Second match:

 a b c
      ^
      3

With the global matching algorithm above, there can only be at most 2 matches starting at the same index, and such case can only happen when the first one is an empty match.

Note that JavaScript simply increment current index by 1 if the main match is empty, so there is at most 1 match per index. However, in this case (.*), if you use global flag g to do global matching, the same result would happen:

(Result below is from Firefox, note the g flag)

> "XYZ".replace(/(.*)/g, "A $1 B")
"A XYZ BA  B"
Sign up to request clarification or add additional context in comments.

2 Comments

I understand what you are saying. Matching the empty string once at the end is an artifact of the way this implementation works. But I think this is a bug and not a feature. The .net implementation should do what the JavaScript regex engine does. Upvoted for the effort.
@SandeepDatta: I guess it is a matter of understanding and familiarity with the features and the quirks of a language.
4

I'll have to contemplate why this happens. Am sure you're missing something. Though this fix the problem. Just anchor the regex.

var r = new Regex("^(.*)$");

Here's the .NetFiddle demo

1 Comment

I like this regex better since it will work for empty input strings too.
3

The * quantifier matches 0 or more. This causes there to be 2 matches. XYZ and nothing.

Try the + quantifier instead which matches 1 or more.

A plain explanation would be to look at the string like this: XYZ<nothing>

  1. We have the matches XYZ and <nothing>
  2. For each match
    • Match 1: Replace XYZ with A $1 B ($1 is here XYZ) Result: A XYZ B
    • Match 2: Replace <nothing> with A $1 B ($1 is here <nothing>) Result: A B

End result: A XYZ BA B

Why <nothing> is a match by itself is interesting and something I haven't really thought much about. (Why aren't there infinite <nothing> matches?)

10 Comments

Makes perfect sense! This is also explains the second snippet.
* matches 0 or more but by default it regex is greedy isn't it? So it should match all the characters! Correct me if am wrong?
Yes, that is what has me puzzled.
@SandeepDatta Not necessarily to be a bug. may be we all together miss something very basic!
I think it's a question of definition. Greedily, all of nothing is a match and all of something is a match. Nothing is no longer nothing if it is put together with something, so they have to be two seperate matches...?
|
3

You regex has two matches and Replace will replace both of them. The first is "XYZ" and the second is an empty string. What I'm not sure of is why it has two matches in the first place. You can fix it with ^(.*)$ to force it to consider the beginning and end of the string.

Or use + instead of * to force it to match at least one character.

.* matches an empty string because it has zero characters.

.+ does not match an empty string because it requires at least one character.

Interestingly, in Javascript (in Chrome):

var r = /(.*)/;
var s = "XYZ";
console.log(s.replace(r,"A $1 B");

Will output the expected A XYZ B without the spurious extra match.

Edit (thanks to @nhahtdh): but adding the g flag to the Javascript regex, give you the same result as in .NET:

var r = /(.*)/g;
var s = "XYZ";
console.log(s.replace(r,"A $1 B");

7 Comments

* matches 0 or more but by default it regex is greedy isn't it? So it should match all the characters! Correct me if am wrong?
Accepted since you were the first to answer with a comment. I have added the comment to your answer.
@SriramSakthivel: Yep, I'll give you that. I tried the same in Javascript (see my edit) and it doesn't match and replace the empty string.
@MattBurland Curiously tried RegexOptions.EcmaScript but failed :(
You forgot the g flag: "XYZ".replace(/(.*)/g, "A $1 B"). There is no reason JS would return a different result here. If you match once (without g flag), then nothing interesting would happen.
|
1

Regex is a peculiar language. You have to understand exactly what (.*) is going to match. You also need to understand greediness.

  • (.*) will greedily match 0 or more characters. So, in the string "XYZ", it will match the entire string with its first match and place it in the $1 position, giving you this:

    A XYZ B It will then continue to try to match and match null at the end of the string, setting your $1 to null, giving you this:

    A B Resulting in the string you are seeing:

    A XYZ BA B

  • If you were to want to limit the greediness and match each character, you would use this expression:

    (.*?)
    This would match each character X, Y, and Z separately, as well as null at the end and result in this:

    A BXA BYA BZA B

If you do not want your regex to exceed the bounds of your given string, then limit your regex with ^ and $ identifiers.

To give you a better perspective of what is happening, consider this test and the resulting matching groups.

    [TestMethod()]
    public void TestMethod3()
    {
        var myText = "XYZ";
        var regex = new Regex("(.*)");
        var m = regex.Match(myText);
        var matchCount = 0;
        while (m.Success)
        {
            Console.WriteLine("Match" + (++matchCount));
            for (int i = 1; i <= 2; i++)
            {
                Group g = m.Groups[i];
                Console.WriteLine("Group" + i + "='" + g + "'");
                CaptureCollection cc = g.Captures;
                for (int j = 0; j < cc.Count; j++)
                {
                    Capture c = cc[j];
                    Console.WriteLine("Capture" + j + "='" + c + "', Position=" + c.Index);
                }
            }
            m = m.NextMatch();
        }

Output:

Match1
Group1='XYZ'
Capture0='XYZ', Position=0
Group2=''
Match2
Group1=''
Capture0='', Position=3
Group2=''

Notice that there are two Groups that matched. The first was the entire group XYZ, and the second was an empty group. Nevertheless, there were two groups matched. So the $1 was swapped out for XYZ in the first case and with null for the second.

Also note, the forward slash / is just another character considered in the .net regex engine and has no special meaning. The javascript parser handles / differently because it must because it exists in the framework of HTML parsers where </ is a special consideration.

Finally, to get what you actually desire, consider this test:

    [TestMethod]
    public void TestMethod1()
    {
        var r = new Regex(@"^(.*)$");
        var c = "XYZ";
        var uc = r.Replace(c, "A $1 B");

        Assert.AreEqual("A XYZ B", uc);
    }

5 Comments

So...are you suggesting that the difference here between C# and Javascript behavior is because C# strings are null-terminated (internally at least) while Javascript strings are not (as far as I can tell)?
That truly is an interesting problem considering C# strings are not null terminated. But the .net regex engine appears to attempt a match past the bounds of the given string if not constrained with ^ and $.
From C# in Depth: Although strings aren't null-terminated as far as the API is concerned, the character array is null-terminated, as this means it can be passed directly to unmanaged functions without any copying being involved, assuming the inter-op specifies that the string should be marshalled as Unicode. So my thought is that internally the regex engine must be working with the character array directly and sees the null terminating character. But I truly don't know.
@AaronPalmer: I don't think it matches past the bounds. The engine simply see a string of length 3 abc as having 4 indices (0, 1, 2, 3). The search is made at index 3, which explains the empty string result. Of course, accessing the index 3 of the string simply doesn't work, but we should think of the index as the space between the characters here.
@MattBurland, very possible, and that fits in nicely with nhahtdh's explanation as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.