2

I am trying to extract 22 chocolates from the following string:

   SOMETEXT for 2 FFXX. Another 22 chocolates & 45 chamkila.

using regex \\d+\\s*(chocolates.|chocolate.). I used :

grep("\\d+\\s*(chocolates.|chocolate.)",s)

but it does not give the string 22 chocolates. How could I extract the part that is matching the regex?

2
  • "[0-9]+ chocolates" works for me in sublime Commented Feb 24, 2018 at 10:37
  • @iOSDeveloper It just returns a number, which is equal to 1 Commented Feb 24, 2018 at 10:39

2 Answers 2

4

Here is an option using sub from base R:

x <- "SOMETEXT for 2 FFXX. Another 22 chocolates & 45 chamkila."
sub(".*?(\\d+ chocolates?).*", "\\1", x)

22 chocolates

The pattern in parentheses, (\\d+ chocolates?), is a capture group, and is available as \\1 after sub has run on the match.

Demo

Edit:

As you have seen, if sub cannot find an exact match, it will return the input string. This behavior often makes sense, because in a case where a substitution does not make sense, you would want the input to not be changed.

If you need to find out whether or not the pattern matches, then calling grep is one option:

grep(".*(\\d+ chocolates?).*",x,value = FALSE)
Sign up to request clarification or add additional context in comments.

6 Comments

Could you explain why doesn't sub("\\d+\\s*(wins.|win.)","\\1",c("Nominated for 2 Oscars. Another 22 wins & 64 nominations.")) work? Also what does \\1 mean?
Okay, and how to check if the match ever occurred? For example, if I change chocolates to chocos, it just returns the entire string
sub is substituting one string (matched by regex pattern) for another. If the pattern can't match, then there's nothing to be substituted, so the original string will return.
If you look carefully in your "wins" example, a substitution does occur. The entire matched pattern "22 wins" is replaced with the part of the pattern within the capture group "wins", ie: the "22 " is removed.
@rosscova Had a small query. What is the use of ? in .*? in the complete regex .*?(\\d+ chocolates?).*. Why can't it just be .*?
|
0

Your original pattern does not return 22 chocolates because it is a pattern that should be used in a matching function, while grep only returns whole items in a character vector that contain the match anywhere inside.

Also, note that (chocolates.|chocolate.) alternation group can be shortened to chocolates?. since the only difference is the plural case for chocolate and it can easily be achieved with a ? quantifier (=1 or 0 occurrences).

A matching function example can be with stringr::str_extract (str_extract_all to match all occurrences):

> library(stringr)
> x <- " SOMETEXT for 2 FFXX. Another 22 chocolates & 45 chamkila."
> p <- "\\d+\\s*chocolates?"
> str_extract(x, p)
[1] "22 chocolates"

Or a base R regmatches/regexpr (or gregexpr to extract multiple occurrences) approach:

> library(stringr)
> x <- " SOMETEXT for 2 FFXX. Another 22 chocolates & 45 chamkila."
> p <- "\\d+\\s*chocolates?"
> regmatches(x, regexpr(p, x))
[1] "22 chocolates"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.