4

I discovered that I can use =~ operator instead of expr command in my 4.2.10(1) BASH. It is much faster (within a command) than expr and this fact could be important inside in a loop with large repetition.

I was able to use most of the meta characters of regular expression but not all.

For example I can check a string matches exactly 3 repetitions of (one small letter, one digit, one dot):

[[ "b3.f5.h3." =~ ^([a-z][0-9]\.){3}$  ]] && echo OK
OK

and I can select matched substrings:

[[ "whatis12345thetwo765nmbers" =~ ^[a-z]+([0-9]+)[a-z]+([0-9]+) ]] && \
echo "The two number fields are: ${BASH_REMATCH[1]}  ${BASH_REMATCH[2]}"
The two number fields are: 12345  765

But I would like to use more meta characters, such as the ones listed on this TLDP page.

I would especially like to match word boundaries: \b, \B, \<, \> .

I tried to find an answer in the Advanced Bash-Scripting Guide (in Chapters 18 and 37) but was unsuccessful.

Where can I find a detailed description of =~ operator?

At the moment I am interested only in BASH and not in gawk, sed, perl or other tools.

4
  • 2
    In general, the TLDP's documentation is often inaccurate or out-of-date (or accurate but showing examples that showcase bad practices). The bash-hackers wiki and the Greg Wooledge wiki are much sources for information on bash. Commented Feb 17, 2016 at 13:05
  • 1
    As for your specific question: Bash doesn't have its own regex implementation -- it depends on your platform's -- so it's important that you include that platform in your question, unless you're only interested in the portable subset. Commented Feb 17, 2016 at 13:17
  • (s/much sources/much better sources/ in the first comment, of course). Commented Feb 17, 2016 at 22:11
  • 1
    BTW, insofar as your interest is in word boundaries, I often use (^|[[:space:]])WordToMatch($|[[:space:]]) or similar. It's not exactly identical semantics, but frequently good enough. Commented Jan 5, 2021 at 16:41

2 Answers 2

8

=~ supports POSIX ERE with no extensions additional to those added by the local C library (literally, it calls the standard C library's regex calls). Thus, the canonical documentation on features it's guaranteed to support (as opposed to optional features your local C library may add in addition) is the specification on ERE, IEEE 1003.1, section 9.4.


To amplify this: Anything, such as \<, added by one particular libc (ie. glibc) but not present in the POSIX specification cannot be expected to work portably across all platforms bash supports.

The POSIX-specified special characters (as given in section 9.4.3 of the standard) do not include <, >, b or B; these are all GNU extensions and nonportable.

Sign up to request clarification or add additional context in comments.

5 Comments

I don't want to be pedantic, but I think it is worth to mention that the question was about "what is supported by the =~ operator" and not what is the posix standard for extended regular expressions. Actually, if the code runs on a GNU system it might support more features. And these more features are made to use them! Some implementations and their distributors might come with the POSIX excuse, however in fact they simply did not further developed that tools. GNU did that. I think we should honour that.
@hek2mgl, if you know that your code will never be run on any non-GNU platform that's one thing, but not everyone has that assurance -- and the question was about what bash would support [and the only reasonable answer is that it's guaranteed to support only what the library calls it uses are guaranteed to support]. Heck, even if using the Linux kernel, glibc still isn't the only kernel out there -- when I'm building an embedded appliance, I'm just as likely to reach for musl libc instead of glibc, and for very good reasons: etalabs.net/compare_libcs.html
@hek2mgl, anyhow, as I pointed out in a comment on the question, "what is supported by the =~ operator?" is a question that can only be answered in a platform-specific way with a platform being specified, and the OP didn't provide any such specification.
The latest trend I observe is to use Alpine (musl, busy-box based) as the base image for containers! Welcome back in the shell-script-80s! (actually I never scripted at that time, but we had these guys in school) For the advantage of a few MB! But for embedded development it makes (still :) sense, of course.
At least busybox has ash these days. Back when its shell options were not even remotely POSIX-y... it wasn't a happy time.
0

The bash manual says:

An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)).

The best resource for how extended (posix) regular expressions work on your system is IMHO man egrep.

2 Comments

Often, a better specification than that given in the grep man page may be found in man 7 re_format or man 7 regexp (depending on the OP's platform).
@CharlesDuffy While the man 7 documents explaining the standard and not how the standard has been implemented on my individual system (I'm referring to GNU extensions), I think they are kind of "wall of text" style. man egrep is for me the page of choice. But again, if I want to know what is portable, I would have a look at the POSIX documents you've linked in your answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.