1

I am trying to extract some variable names and numbers from the following vector and store them into two new variables:

unique_strings <- c("PM_1_PMS5003_S_Avg", "PM_2_5_PMS5003_S_Avg", "PM_10_PMS5003_S_Avg", 
  "PM_1_PMS5003_A_Avg", "PM_2_5_PMS5003_A_Avg", "PM_10_PMS5003_A_Avg", 
  "PNC_0_3_PMS5003_Avg", "PNC_0_5_PMS5003_Avg", "PNC_1_0_PMS5003_Avg", 
  "PNC_2_5_PMS5003_Avg", "PNC_5_0_PMS5003_Avg", "PNC_10_0_PMS5003_Avg", 
  "PM_1_PMS7003_S_Avg", "PM_2_5_PMS7003_S_Avg", "PM_10_PMS7003_S_Avg", 
  "PM_1_PMS7003_A_Avg", "PM_2_5_PMS7003_A_Avg", "PM_10_PMS7003_A_Avg", 
  "PNC_0_3_PMS7003_Avg", "PNC_0_5_PMS7003_Avg", "PNC_1_0_PMS7003_Avg", 
  "PNC_2_5_PMS7003_Avg", "PNC_5_0_PMS7003_Avg", "PNC_10_0_PMS7003_Avg"
)

I would like to extract each character before the PMS for the first variable. This includes the strings that being with PM or PNC, as well as the underscores and digits. I would like to store these results into a variable called pollutant.

Desired output:

unique(pollutant)
[1] "PM_1" "PM_2_5" "PM_10" "PNC_0_3" "PNC_0_5" "PNC_1_0" "PNC_2_5" "PNC_5_0" "PNC_10"

I would like to extract everything after the PMS for the second variable.

For this, I first tried extracting just the model numbers (four-digit numbers ending in 003) from each string, however, it would be useful to include the A_Avg or S_Avg in the extraction as well.

Here's my first attempt:

model_id <- str_extract(unique_strings, "[0-9]{4,}")

unique(model_id)
[1] "5003" "7003"

I have not used regex before and am having a difficult time navigating existing docs / stack posts. Your input is appreciated!

2 Answers 2

2

We can use str_split to split the string based on "PMS". After that, use str_replace to remove the last "_" in the first column. The output is m. The first variable is in the first column, while the second variable is in the second column.

library(stringr)
m <- str_split(unique_strings, pattern = "PMS", simplify = TRUE)
m[, 1] <- str_replace(m[, 1], "_$", "")
m
#       [,1]       [,2]        
#  [1,] "PM_1"     "5003_S_Avg"
#  [2,] "PM_2_5"   "5003_S_Avg"
#  [3,] "PM_10"    "5003_S_Avg"
#  [4,] "PM_1"     "5003_A_Avg"
#  [5,] "PM_2_5"   "5003_A_Avg"
#  [6,] "PM_10"    "5003_A_Avg"
#  [7,] "PNC_0_3"  "5003_Avg"  
#  [8,] "PNC_0_5"  "5003_Avg"  
#  [9,] "PNC_1_0"  "5003_Avg"  
# [10,] "PNC_2_5"  "5003_Avg"  
# [11,] "PNC_5_0"  "5003_Avg"  
# [12,] "PNC_10_0" "5003_Avg"  
# [13,] "PM_1"     "7003_S_Avg"
# [14,] "PM_2_5"   "7003_S_Avg"
# [15,] "PM_10"    "7003_S_Avg"
# [16,] "PM_1"     "7003_A_Avg"
# [17,] "PM_2_5"   "7003_A_Avg"
# [18,] "PM_10"    "7003_A_Avg"
# [19,] "PNC_0_3"  "7003_Avg"  
# [20,] "PNC_0_5"  "7003_Avg"  
# [21,] "PNC_1_0"  "7003_Avg"  
# [22,] "PNC_2_5"  "7003_Avg"  
# [23,] "PNC_5_0"  "7003_Avg"  
# [24,] "PNC_10_0" "7003_Avg"
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the help. This did the trick and I'm feeling more comfortable with regex now!
1

We can use str_extract to match either 'PM' or 'PNC' from the start (^) of the string (^(PM|PNC)) followed by a _ and one or more digits (\\d+) followed by cases that have another set of _ and digits (for this we specify zero or more ((_\\d)*)

library(stringr)
out <- str_extract(unique_strings, "^(PM|PNC)_\\d+(_\\d)*")

This will give NA for those elements that don't have a match. If we need to remove those

na.omit(out)

For the second case, it is not clear about the desired output. If we need to to extract everything after the PMS, we can do with a regexlookbehind to((?<=PMS)) and match all the characters that follow (.*)

str_extract(unique_strings, "(?<=PMS).*")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.