0

In Powershell (5.1 or 7), I run:

PS R:\> "abcdef" -replace '.*','x'
xx
PS R:\> "abcdef" -replace '.+','x'
x
PS R:\> "abcdef" -replace '^.*','x'
x
PS R:\> "abcdef" -replace '^.+','x'
x
PS R:\>
PS R:\> "abcdef" -replace '^','x'
xabcdef
PS R:\>

As you can see, in the first run I got xx but was expecting a single x. Tried with sed in bash (executables from gitdir/usr/bin; msys I think), and got what I expected.

2021-05-01 01:34:27 /r :
$ echo "abcdef" | sed -E s/.*/x/g
x

2021-05-01 01:35:03 /r :
$ echo "abcdef" | sed -E s/.+/x/g
x

2021-05-01 01:35:08 /r :
$ echo "abcdef" | sed -E s/^.*/x/g
x

2021-05-01 01:35:17 /r :
$ echo "abcdef" | sed -E s/^.+/x/g
x

2021-05-01 01:35:20 /r :
$ echo "abcdef" | sed -E s/^/x/g
xabcdef

2021-05-01 01:35:25 /r :
$

I have tried the documentation and cant figure out how to understand what is happening.

6
  • Seems this is a regex behavior, not powershell specifically. 2 matches are returned. I cannot explain it though. regex101.com/r/TE7TcT/1 Commented May 1, 2021 at 8:52
  • Perhaps because the very first match is the zero match and the rest is the or more match. (in regex the asteriks means zero or more matches) ? Commented May 1, 2021 at 9:55
  • 2
    @Theo Nope, it's the other way around - first match is abcdef, second is the empty string between f and the end of the string Commented May 1, 2021 at 10:10
  • @MathiasR.Jessen God to know!. I came up with that by anchoring to the end "abcdef" -replace '.*$','x' --> xx, while anchoring to the beginning of the string "abcdef" -replace '^.*','x' returned the single x Commented May 1, 2021 at 10:15
  • This is what RegexBuddy makes of it i.sstatic.net/BZE5p.png - the same warning is shown when selecting .NET so looks like a general .NET thing Commented May 1, 2021 at 10:49

2 Answers 2

4

Let's find out!

The easiest way to find out what exactly was matched by a regex pattern in any version of PowerShell is by using Regex.Matches():

PS ~> [regex]::Matches('abcdef', '.*')
    
Groups   : {0}
Success  : True
Name     : 0
Captures : {0}
Index    : 0
Length   : 6
Value    : abcdef

Groups   : {0}
Success  : True
Name     : 0
Captures : {0}
Index    : 6
Length   : 0
Value    :

Aha! It's matching the substring abcdef, and then the empty string between f and the end of the string.


In PowerShell 7 we can also use a scriptblock with the replace operator to confirm:

PS ~> "abcdef" -replace '.*',{"['$($_.Value)' (length $($_.Length)) starting at $($_.Index)]"}
['abcdef' (length 6) starting at 0]['' (length 0) starting at 6]

I'm afraid I don't now why the regex engine implementors decided that this behavior was preferable to the behavior of sed, but at least we know what happens now.

Sign up to request clarification or add additional context in comments.

3 Comments

Nice demonstration; as for the why: long discussion here, but it still doesn't fully make sense to me.
@mklement0 There's a comment there about "posix leftmost longest match" for sed and awk. Regex101.com shows 2 matches.
@mklement0 interesting observations. Having not really given it much thought previously my initial instinct was actually "sed is being weird and 'friendly', .NET is acting how I would expect", along the same lines you point out halfway through (ie. "position N is a perfectly valid offset for macthing")
3

Select-string showing 2 matches:

# select-string highlights matches in ps 7, but you can't see the 2nd match anyway
'abcdef' | select-string .* -AllMatches | % matches   # 2 matches

Looks like a .Net thing, even in Powershell 7. regex101.com/r/VzxbOT/1 gives 2 matches as well, so maybe it's sed that's wrong ("posix leftmost longest match?" Should .net follow that standard?), since the /g means global or all matches?

[regex]::Replace('abcdef','.*','x')

xx

Replace only one time (Replacing only the first occurrence of a word in a string):

$pattern = [regex]'.*'
$pattern.replace('abcdef','x',1)

x

Search and replace in awk in osx works the same as sed. Only works in bash for some reason. Oh you'd have to backslash the required doublequotes in powershell.

echo 'abcdef' | awk '{ gsub(/.*/,\"x\"); print }'

x

3 Comments

Yes, sed and awk (both the BSD/macOS and the GNU/Linux implementations globally match .* only once; ditto for mawk. Python 2.x and Python up to v3.6 match only once also, but from what I can tell the majority of engines match twice.
An aside re the unfortunate need to \ -escape the " chars. PowerShell Core 7.2.0-preview.5 introduced experimental feature PSNativeCommandArgumentPassing, which makes this no longer necessary; it works robustly on Unix platforms, but on Windows important accommodations are missing; plus, there are currently bugs - see GitHub issue #15143.
Nice, thanks for the regex101 link with explanations.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.