0

I have searched high and low and nobody seems to have asked that exact question, so I'm at loss.

I have a data frame with a couple columns. One of this column contains various sentences that don't have a specific format of pattern, which limit how I can extract words from this column because I can't use position. My goal is to search this column and extract species name from the sentences. I need to be able to extract multiple species at once because sometimes the sample has tested + for more than one species and I need that information. My method works fairly well and the output works when there is a single species. The problem is when it identifies more than one species. I would want the output to be something like : sp1,sp2,sp3, but instead I get written: c("sp1","sp2"). I have no clue how I could change that. I tried using toString to no avail. Also, everything is written lower case, so there is no case issue

I tried

df = df %>% mutate(Species=str_extract_all(df$RESULT,"sp1|sp2|sp3|sp4"))

where RESULT is the column that I said contains different sentences. This is what I get as output:

Result Species
Bla-bla-bla-sp1-bla-sp2 c("sp1","sp2")
Bla-bla-bla-sp3-bla-bla sp3

But I would want:

Result Species
Bla-bla-bla-sp1-bla-sp2 sp1, sp2
Bla-bla-bla-sp3-bla-bla sp3

I tried:

df = df %>% mutate(Species=str_extract_all(toString(df$RESULT,"sp1|sp2|sp3|sp4")))

but the output ended up being the same

Thanks in advance for your help! I now this is not the most clear example, but I can't use my real data as it contains sensitive info Also just so you know, I wrote sp1, sp2 just to mimick species name but my real data doesn't have species all starting with the same letter which really limits the method of extraction I can use. For example it's cat,dog,bird, so methods with sp\d+ won't work, because it's not really species1, species2

4
  • Please share a reproducible example of your data. Commented Jul 20, 2024 at 17:07
  • In general your approach works, but don't use the data.frame name within mutate, rather try something like this df %>% mutate(Species = str_extract_all(Result, "sp\\d+")). "sp1|sp2|sp3|sp4" should also work. Commented Jul 20, 2024 at 18:00
  • If you want it to be a string try df %>% rowwise() %>% mutate(Species = toString(unlist(str_extract_all(Result, "sp\\d+")))) Commented Jul 20, 2024 at 18:06
  • Seems like paste with collapse=‘,’ should have been the first attempt. And you don’t want a list. You want a character value. Commented Jul 20, 2024 at 20:04

2 Answers 2

1

Try this

df %>% 
  mutate(Species = str_extract_all(Result, "sp1|sp2|sp3|sp4") %>%
           purrr::map_chr(~str_c(.x, collapse = ", ")))
Sign up to request clarification or add additional context in comments.

Comments

0

For a pure base R approach, you can extract all species abbreviations using gregexpr and then regmatches (str_extract_all from stringr simplifies this approach). To collapse multiple extracted strings into one, use paste with collapse=",". Lastly, we wrap this all into lapply to return the results for each row of the data frame.

df$Species <- lapply(regmatches(df$Result, 
                         gregexpr("sp\\d+", df$Result)),
                     paste, collapse=",")
df

df
  id                  Result      Species
1  1 Bla-bla-bla-sp1-bla-sp2      sp1,sp2
2  2 Bla-bla-bla-sp3-bla-bla          sp3
3  3        sp7-bla-sp1-sp11 sp7,sp1,sp11

To sort the vectors in the result, we need to remove the "sp" using gsub, convert to numeric using as.numeric, and then sort the elements using sort.

df$Species <- lapply(
       regmatches(df$Result, 
             gregexpr("sp\\d+", df$Result)),
  \(x) paste0("sp", sort(as.numeric(
                    gsub(x, pattern='sp', replacement=''))),
              collapse=","))
df

  id                  Result      Species
1  1 Bla-bla-bla-sp1-bla-sp2      sp1,sp2
2  2 Bla-bla-bla-sp3-bla-bla          sp3
3  3        sp7-Bla-sp11-sp1 sp1,sp7,sp11

df <- structure(list(id = 1:3, Result = c("Bla-bla-bla-sp1-bla-sp2", 
"Bla-bla-bla-sp3-bla-bla", "sp7-bla-sp1-sp11")), class = "data.frame", row.names = c(NA, 
-3L))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.