Extract variable names using stringr in R

Question

I am trying to extract some variable names and numbers from the following vector and store them into two new variables:

unique_strings <- c("PM_1_PMS5003_S_Avg", "PM_2_5_PMS5003_S_Avg", "PM_10_PMS5003_S_Avg", 
  "PM_1_PMS5003_A_Avg", "PM_2_5_PMS5003_A_Avg", "PM_10_PMS5003_A_Avg", 
  "PNC_0_3_PMS5003_Avg", "PNC_0_5_PMS5003_Avg", "PNC_1_0_PMS5003_Avg", 
  "PNC_2_5_PMS5003_Avg", "PNC_5_0_PMS5003_Avg", "PNC_10_0_PMS5003_Avg", 
  "PM_1_PMS7003_S_Avg", "PM_2_5_PMS7003_S_Avg", "PM_10_PMS7003_S_Avg", 
  "PM_1_PMS7003_A_Avg", "PM_2_5_PMS7003_A_Avg", "PM_10_PMS7003_A_Avg", 
  "PNC_0_3_PMS7003_Avg", "PNC_0_5_PMS7003_Avg", "PNC_1_0_PMS7003_Avg", 
  "PNC_2_5_PMS7003_Avg", "PNC_5_0_PMS7003_Avg", "PNC_10_0_PMS7003_Avg"
)

I would like to extract each character before the PMS for the first variable. This includes the strings that being with PM or PNC, as well as the underscores and digits. I would like to store these results into a variable called pollutant.

Desired output:

unique(pollutant)
[1] "PM_1" "PM_2_5" "PM_10" "PNC_0_3" "PNC_0_5" "PNC_1_0" "PNC_2_5" "PNC_5_0" "PNC_10"

I would like to extract everything after the PMS for the second variable.

For this, I first tried extracting just the model numbers (four-digit numbers ending in 003) from each string, however, it would be useful to include the A_Avg or S_Avg in the extraction as well.

Here's my first attempt:

model_id <- str_extract(unique_strings, "[0-9]{4,}")

unique(model_id)
[1] "5003" "7003"

I have not used regex before and am having a difficult time navigating existing docs / stack posts. Your input is appreciated!

www · Accepted Answer · 2017-12-22 03:06:19Z

2

We can use str_split to split the string based on "PMS". After that, use str_replace to remove the last "_" in the first column. The output is m. The first variable is in the first column, while the second variable is in the second column.

library(stringr)
m <- str_split(unique_strings, pattern = "PMS", simplify = TRUE)
m[, 1] <- str_replace(m[, 1], "_$", "")
m
#       [,1]       [,2]        
#  [1,] "PM_1"     "5003_S_Avg"
#  [2,] "PM_2_5"   "5003_S_Avg"
#  [3,] "PM_10"    "5003_S_Avg"
#  [4,] "PM_1"     "5003_A_Avg"
#  [5,] "PM_2_5"   "5003_A_Avg"
#  [6,] "PM_10"    "5003_A_Avg"
#  [7,] "PNC_0_3"  "5003_Avg"  
#  [8,] "PNC_0_5"  "5003_Avg"  
#  [9,] "PNC_1_0"  "5003_Avg"  
# [10,] "PNC_2_5"  "5003_Avg"  
# [11,] "PNC_5_0"  "5003_Avg"  
# [12,] "PNC_10_0" "5003_Avg"  
# [13,] "PM_1"     "7003_S_Avg"
# [14,] "PM_2_5"   "7003_S_Avg"
# [15,] "PM_10"    "7003_S_Avg"
# [16,] "PM_1"     "7003_A_Avg"
# [17,] "PM_2_5"   "7003_A_Avg"
# [18,] "PM_10"    "7003_A_Avg"
# [19,] "PNC_0_3"  "7003_Avg"  
# [20,] "PNC_0_5"  "7003_Avg"  
# [21,] "PNC_1_0"  "7003_Avg"  
# [22,] "PNC_2_5"  "7003_Avg"  
# [23,] "PNC_5_0"  "7003_Avg"  
# [24,] "PNC_10_0" "7003_Avg"

answered Dec 22, 2017 at 3:06

www

39.3k12 gold badges52 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

philiporlando Over a year ago

Thanks for the help. This did the trick and I'm feeling more comfortable with regex now!

akrun · Accepted Answer · 2017-12-22 03:07:32Z

1

We can use str_extract to match either 'PM' or 'PNC' from the start (^) of the string (^(PM|PNC)) followed by a _ and one or more digits (\\d+) followed by cases that have another set of _ and digits (for this we specify zero or more ((_\\d)*)

library(stringr)
out <- str_extract(unique_strings, "^(PM|PNC)_\\d+(_\\d)*")

This will give NA for those elements that don't have a match. If we need to remove those

na.omit(out)

For the second case, it is not clear about the desired output. If we need to to extract everything after the PMS, we can do with a regexlookbehind to((?<=PMS)) and match all the characters that follow (.*)

str_extract(unique_strings, "(?<=PMS).*")

edited Dec 22, 2017 at 3:07

answered Dec 22, 2017 at 3:02

akrun

891k38 gold badges590 silver badges700 bronze badges

Collectives™ on Stack Overflow

Extract variable names using stringr in R

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related