1

I have a dataframe with two columns for two kinds of rates. But the data inconsistency there has made it difficult to differentiate thousand rates from the normal rates. So, I need to find a pattern of {dot}{3digits} and replace the first dot with blank. So I have created a function to take care of that. But the function is not working as expected and I'm getting an error/warning as below. What could be the issue here?

DF <- data.frame(
  PRODUCT_ID = c("A125", "A444", "A744", "B12", "C7", "D1"),
  LOC_1 = c("2.000.00 usd pr. kg", "37.61 usd pr. piece", "46.58 usd pr. item", "327%", "0", NA),
  LOC_2 = c("45.000.00 usd pr. kg", "3.000.00 usd pr. piece", "780.00 usd pr. item", "505%", "0%", NA)
)
> DF
  PRODUCT_ID               LOC_1                  LOC_2
1       A125 2.000.00 usd pr. kg   45.000.00 usd pr. kg
2       A444 37.61 usd pr. piece 3.000.00 usd pr. piece
3       A744  46.58 usd pr. item    780.00 usd pr. item
4        B12                327%                   505%
5         C7                   0                     0%
6         D1                <NA>                   <NA>

# function to remove the thousand dots (2.000.00 => 2000.00)
dutyFormat <- function(x) {
  if(str_detect(x, '\\.\\d{3}')) {
    return(sub('.', '', x, fixed = TRUE))
  } else {
    return(x)
  }
}

DF[,2:3] <- lapply(DF[,2:3], dutyFormat)
Warning messages:
1: In if (str_detect(x, "\\.\\d{3}")) { :
  the condition has length > 1 and only the first element will be used
2: In if (str_detect(x, "\\.\\d{3}")) { :
  the condition has length > 1 and only the first element will be used

> DF
  PRODUCT_ID              LOC_1                 LOC_2
1       A125 2000.00 usd pr. kg   45000.00 usd pr. kg
2       A444 3761 usd pr. piece 3000.00 usd pr. piece
3       A744  4658 usd pr. item    78000 usd pr. item
4        B12               327%                  505%
5         C7                  0                    0%
6         D1               <NA>                  <NA>

My expected output is this:

enter image description here

1
  • Your dutyFormat is not vectorized because of the if statement. Your code passes in the whole column of values at once, but the if statement can only check one value at a time. Consider using something like ifelse() instead which is vectorized. Commented Jun 16, 2022 at 12:45

2 Answers 2

1

Although your if statement did found patterns of '\\.\\d{3}', the subsequent sub operation did not take consideration of that. It will just remove the first dot.

You could use a regex to find out pattern of '\\.\\d{3}', then capture the digits, and replace the whole thing with the capture group (i.e. the digits).

library(dplyr)

DF %>% mutate(across(starts_with("LOC"), ~gsub("\\.(\\d{3})", "\\1", .x)))

  PRODUCT_ID               LOC_1                 LOC_2
1       A125  2000.00 usd pr. kg   45000.00 usd pr. kg
2       A444 37.61 usd pr. piece 3000.00 usd pr. piece
3       A744  46.58 usd pr. item   780.00 usd pr. item
4        B12                327%                  505%
5         C7                   0                    0%
6         D1                <NA>                  <NA>
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. This helps me. What is the meaning of "\\1" here?
@Frodo In this regex "\\.(\\d{3})", you can see that (\\d{3}) is surround by curly bracket, which means "capturing" these strings into a capture group. \\1 is the syntax to backreference the first capture group in the regex, and thus the digits. Learn more here regular-expressions.info/refcapture.html
0

You can use stringr::str_replace_all to replace all '.' which are followed by three digits with ''. I used a lookahead for this (see http://www.rexegg.com/regex-lookarounds.html).

DF <- data.frame(
  PRODUCT_ID = c("A125", "A444", "A744", "B12", "C7", "D1"),
  LOC_1 = c("2.000.00 usd pr. kg", "37.61 usd pr. piece", "46.58 usd pr. item", "327%", "0", NA),
  LOC_2 = c("45.000.00 usd pr. kg", "3.000.00 usd pr. piece", "780.00 usd pr. item", "505%", "0%", NA)
)



for (col in colnames(DF)[2:3]) {
  DF[[col]] <- stringr::str_replace_all(DF[[col]], '\\.(?=\\d{3})', '')
}

DF

Which gives

  PRODUCT_ID               LOC_1                 LOC_2
1       A125  2000.00 usd pr. kg   45000.00 usd pr. kg
2       A444 37.61 usd pr. piece 3000.00 usd pr. piece
3       A744  46.58 usd pr. item   780.00 usd pr. item
4        B12                327%                  505%
5         C7                   0                    0%
6         D1                <NA>                  <NA>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.