1

I'm trying to add a new column based on another using pattern matching. I've read this post, but not getting the desired output.

I want to create a new column (SubOrder) based on the GreatGroup column. I have tried the following:

SubOrder <- rep(NA_character_, length(myData))

SubOrder[grepl("udults", myData, ignore.case = TRUE)] <-  "Udults"
SubOrder[grepl("aquults", myData, ignore.case = TRUE)] <-  "Aquults"
SubOrder[grepl("aqualfs", myData, ignore.case = TRUE)] <-  "aqualfs"
SubOrder[grepl("humods", myData, ignore.case = TRUE)] <-  "humods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <-  "udalfs"
SubOrder[grepl("orthods", myData, ignore.case = TRUE)] <-  "orthods"
SubOrder[grepl("udalfs", myData, ignore.case = TRUE)] <-  "udalfs"
SubOrder[grepl("psamments", myData, ignore.case = TRUE)] <-  "psamments"
SubOrder[grepl("udepts", myData, ignore.case = TRUE)] <-  "udepts"
SubOrder[grepl("fluvents", myData, ignore.case = TRUE)] <-  "fluvents"
SubOrder[grepl("aquods", myData, ignore.case = TRUE)] <-  "aquods"

For example, I'm looking for "udults" inside any word, such as Hapludults or Paleudults, and return just "udults".

EDIT: If anyone wants to take a shot at alistaire's comment, this is the search patterns I would use.

 subOrderNames <- c("Udults", "Aquults", "Aqualfs", "Humods", "Udalfs", "Orthods", "Psamments", "Udepts", "fluvents")

Example data below.

myData <- dput(head(test))
structure(list(1:6, SID = c(200502L, 200502L, 200502L, 200502L, 
200502L, 200502L), Groupdepth = c(11L, 12L, 13L, 14L, 21L, 22L
), AWC0to10 = c(0.12, 0.12, 0.12, 0.12, 0.12, 0.12), AWC10to20 = c(0.12, 
0.12, 0.12, 0.12, 0.12, 0.12), AWC20to50 = c(0.12, 0.12, 0.12, 
0.12, 0.12, 0.12), AWC50to100 = c(0.15, 0.15, 0.15, 0.15, 0.15, 
0.15), Db3rdbar0to10 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43), 
    Db3rdbar10to20 = c(1.43, 1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar20to50 = c(1.43, 
    1.43, 1.43, 1.43, 1.43, 1.43), Db3rdbar50to100 = c(1.43, 
    1.43, 1.43, 1.43, 1.43, 1.43), HydrcRatngPP = c(0L, 0L, 0L, 
    0L, 0L, 0L), OrgMatter0to10 = c(1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25), OrgMatter10to20 = c(1.25, 1.25, 1.25, 1.25, 1.25, 
    1.25), OrgMatter20to50 = c(1.02, 1.02, 1.02, 1.02, 1.02, 
    1.02), OrgMatter50to100 = c(0.12, 0.12, 0.12, 0.12, 0.12, 
    0.12), Clay0to10 = c(8, 8, 8, 8, 8, 8), Clay10to20 = c(8, 
    8, 8, 8, 8, 8), Clay20to50 = c(9.4, 9.4, 9.4, 9.4, 9.4, 9.4
    ), Clay50to100 = c(40, 40, 40, 40, 40, 40), Sand0to10 = c(85, 
    85, 85, 85, 85, 85), Sand10to20 = c(85, 85, 85, 85, 85, 85
    ), Sand20to50 = c(83, 83, 83, 83, 83, 83), Sand50to100 = c(45.8, 
    45.8, 45.8, 45.8, 45.8, 45.8), pHwater0to20 = c(6.3, 6.3, 
    6.3, 6.3, 6.3, 6.3), Ksat0to10 = c(23, 23, 23, 23, 23, 23
    ), Ksat10to20 = c(23, 23, 23, 23, 23, 23), Ksat20to50 = c(19.7333, 
    19.7333, 19.7333, 19.7333, 19.7333, 19.7333), Ksat50to100 = c(9, 
    9, 9, 9, 9, 9), TaxClName = c("Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", "Fine, mixed, semiactive, mesic Oxyaquic Hapludults", 
    "Fine, mixed, semiactive, mesic Oxyaquic Hapludults"), GreatGroup = c("Hapludults", 
    "Hapludults", "Hapludults", "Hapludults", "Hapludults", "Hapludults"
    )), .Names = c("", "SID", "Groupdepth", "AWC0to10", "AWC10to20", 
"AWC20to50", "AWC50to100", "Db3rdbar0to10", "Db3rdbar10to20", 
"Db3rdbar20to50", "Db3rdbar50to100", "HydrcRatngPP", "OrgMatter0to10", 
"OrgMatter10to20", "OrgMatter20to50", "OrgMatter50to100", "Clay0to10", 
"Clay10to20", "Clay20to50", "Clay50to100", "Sand0to10", "Sand10to20", 
"Sand20to50", "Sand50to100", "pHwater0to20", "Ksat0to10", "Ksat10to20", 
"Ksat20to50", "Ksat50to100", "TaxClName", "GreatGroup"), class = c("tbl_df", 
"data.frame"), row.names = c(NA, -6L))
4
  • 1
    To make your code a lot DRYer, make a vector of your patterns (and replacements, if they're different), and use sapply to call grepl or gsub or whatever you like. Commented Feb 5, 2016 at 21:32
  • I attempted something similar with : subOrderNames <- c("Udults", "Aquults", "Aqualfs", "Humods", "Udalfs", "Orthods", "Psamments", "Udepts", "fluvents") but got lost in the machine 0o0o0 Commented Feb 5, 2016 at 21:51
  • 1
    With a for loop, pat <- c('udults', 'aquults', 'aqualfs', 'humods', 'udalfs', 'orthods', 'psamments', 'udepts', 'fluvents', 'aquods'); for(x in 1:length(pat)){SubOrder[grepl(pat[x], myData$GreatGroup, ignore.case = TRUE)] <- pat[x]} Make a second vector for replacements and substitute it for the second pat[x] if you need. Commented Feb 5, 2016 at 21:54
  • 1
    Or more directly, myData$SubOrder <- myData$GreatGroup; for(x in pat){myData$SubOrder <- gsub(paste0('.*', x, '.*'), x, myData$SubOrder, ignore.case = TRUE)}. If there's no match in this case, it's left with the value of GreatGroup instead of NA, though. Commented Feb 5, 2016 at 22:13

4 Answers 4

5

A few options, some of which I posted in the comments above.

Note: All options assume the replacement for the strings that match patters are just the pattern. If you want something else, they're all easily editable to include separate replacement values.

Option 1: for + grepl

Using the same code as the original, but looping to avoid repetitive code:

# make a list of patterns
pat <- c('udults', 'aquults', 'aqualfs', 'humods', 'udalfs', 'orthods', 'psamments', 'udepts', 'fluvents', 'aquods')

SubOrder <- rep(NA_character_, length(myData))

for(x in 1:length(pat)){
  SubOrder[grepl(pat[x], myData$GreatGroup, ignore.case = TRUE)] <-  pat[x]
}

Option 2: for + gsub

Build the new column in place by copying myData$GreatGroup and then altering it with gsub. The extra regex pasted on includes characters within the same string.

myData$SubOrder <- myData$GreatGroup
for(x in pat){
  myData$SubOrder <- gsub(paste0('.*', x, '.*'), x, myData$SubOrder, ignore.case = TRUE)
}

Note that values not matched by one of the strings in pat will have the value from GreatGroup, not NA. If you want them to be NA, fix them with

myData$SubOrder[!(myData$SubOrder %in% pat)] <- NA

Option 3: named list + stringr::str_replace_all

My favorite because it doesn't loop, although it requires the stringr package (which is pretty awesome, anyway).

Make a named list from pat, where the name is the regex you want to replace, and the item is the string to match:

l <- as.list(pat)
names(l) <- paste0('.*', pat, '.*')

so it looks like

> l
$`.*udults.*`
[1] "udults"

$`.*aquults.*`
[1] "aquults"

$`.*aqualfs.*`
[1] "aqualfs"
......

Then use str_replace_all to DO IT ALL AT ONCE:

myData$SubOrder <- str_replace_all(myData$GreatGroup, l)

Boom.

Note 1: str_replace_all doesn't have an ignore.case option, but you can wrap myData$GreatGroup in tolower (easy) or reconfigure the regex (hard).

Note 2: Like Option 2, it leaves unmatched entries as the value from GreatGroup, so use the line at the end of that option to go back to NAs, if you like.

Sign up to request clarification or add additional context in comments.

1 Comment

+1 for Option 3... though now it doesn't work with named lists, just named vectors names(pat) <- paste0('.*', pat, '.*') and then myData$SubOrder <- str_replace_all(myData$GreatGroup, pat) does the trick.
1

I'm using dplyr, but you probably need to create a giant nested ifelse statement...

library(dplyr)

myData %>%
  mutate(SubOrder = ifelse(grepl('udults', GreatGroup, ignore.case = T), 'Udults',
                           ifelse(grepl('aquults', GreatGroup, ignore.case = T, 'Aquults',
                                        ###  All of the other ifelse statements
                                        ifelse(grepl('fluvents', GreatGroup, ignore.case = T), 'fluvents', 'aquods')
                           ))))

Comments

1

Try this:

myData$SubOrder[grepl("udults", myData$TaxClName, ignore.case = TRUE) | grepl("udults", myData$GreatGroup, ignore.case = TRUE)] <-  "Udults"

You can add as many columns to the filter as you want.

Comments

1

You could do this with a function that successively substitutes each pattern, which avoids repeating your code over and over. Note that with this approach, if a given string matches more than one pattern, the first pattern in the substitution sequence will be the one that gets used.

# multi-grepl function adapted from http://stackoverflow.com/a/15254254/496488
mgrepl <- function(pattern, replacement, x, ...) {
  if (length(pattern) != length(replacement)) {
    stop("pattern and replacement do not have the same length.")
  }
  result <- x
  for (i in 1:length(pattern)) {
    result[grepl(pattern[i], result, ...)] = replacement[i]
  }
  result
}

# Patterns and replacements
pat = c("udults","aquults","humods","fluvents")
repl = c("Udults","Aquults","humods","fluvents")

SubOrder =  mgrepl(pat, repl, myData$GreatGroup)

SubOrder

[1] "Udults" "Udults" "Udults" "Udults" "Udults" "Udults"

# Or, if you want to add this as a new column to the data:
myData$SubOrder = mgrepl(pat, repl, myData$GreatGroup)

One additional note: One issue with the code in your question is that you referenced the whole data frame, rather than the column you want to substitute:

SubOrder[grepl("udults", myData, ignore.case = TRUE)] <-  "Udults"

should be changed to

SubOrder[grepl("udults", myData$GreatGroup, ignore.case = TRUE)] <-  "Udults"

UPDATE: Regarding your comment, see the code below. The function does replace both values with "Udults".

myData$GreatGroup[1] = "Paleudults"

myData$GreatGroup

[1] "Paleudults" "Hapludults" "Hapludults" "Hapludults" "Hapludults" "Hapludults"

mgrepl(pat, repl, myData$GreatGroup)

[1] "Udults" "Udults" "Udults" "Udults" "Udults" "Udults"

2 Comments

Thanks for pointing out the error in my code. Also, it appears this code returns the pattern as the replacement. I would need to be able to search for 'udults' in multiple words such as Hapludults, Paleudults, and just return udults for both.
See the update to my answer. The code in my answer seems to be doing the correct thing as far as I can tell.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.