Is there a R function that detects a specific string and replaces it by the value of another observation based on a number within the string?

Question

So, I am using constituency data of the German Election 1994 and some observations contain strings that indicate that the value is given in a different row (based on the Scheme "siehe Wkr xxx" - "see constituency xxx"). As an example, the non employement rate in Hamburg-Altona is only collected for Hamburg in general, so the constituency Hamburg-Altona should take the value of the observation Hamburg-Mitte.

example_data <- data.frame(constituency_no = c("001", "002", "003", "004", "005"),
                           constituency_name = c("Hamburg-Mitte", "Hamburg-Altona", "Hamburg-Nord", "Lübeck", "Pinneberg"),
                          nonemployementrate = c(0.04, "siehe Wkr 001", "siehe Wkr 001", 0.03, 0.02))

So, I want a function that automatically detects if there is a string beginning with "siehe Wkr " and then replace the value of that string with the value from the constituency number referred to. So in the example I want a function that automatically replaces the value of nonemployementrate with 0.04, as the string for Hamburg-Altona and Hamburg-Nord refers to constituency_no "001".

result <- data.frame(constituency_no = c("001", "002", "003", "004", "005"),
                           constituency_name = c("Hamburg-Mitte", "Hamburg-Altona", "Hamburg-Nord", "Lübeck", "Pinneberg"),
                          nonemployementrate = c(0.04, 0.04, 0.04, 0.03, 0.02))

Friede · Accepted Answer · 2024-10-29 13:04:39Z

2

At the risk of overlooking something relevant.

within(example_data, {
  i = startsWith(nonemployementrate, "siehe")
  nonemployementrate[i] = nonemployementrate[
    match(sub("\\D+", "", nonemployementrate[i]), constituency_no)]
  rm(i)
})

giving

  constituency_no constituency_name nonemployementrate
1             001     Hamburg-Mitte               0.04
2             002    Hamburg-Altona               0.04
3             003      Hamburg-Nord               0.04
4             004            Lübeck               0.03
5             005         Pinneberg               0.02

Edit. A simple function. (You ask for one.)

f = \(X) {
  stopifnot(is.data.frame(X), 
            c("nonemployementrate", "constituency_no") %in% names(X))
  i = startsWith(X$nonemployementrate, "siehe")
  r = match(sub("\\D+", "", X$nonemployementrate[i]), X$constituency_no)
  X$nonemployementrate[i] = X$nonemployementrate[r]
  X
}
f(example_data)

edited Oct 29, 2024 at 13:04

answered Oct 29, 2024 at 12:17

Friede

11.8k2 gold badges14 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Friede Over a year ago

To the downvoter. Please explain why the downvote. Thank you.

SamR · Accepted Answer · 2024-10-29 12:14:32Z

1

Here is an approach which does what you describe in base R, using joins. It checks whether the nonemployementrate starts with "siehe Wkr", and if so it uses the nonemployementrate of the constituency to which it refers. If not it uses the nonemployementrate that is already present.

example_data |>
    transform(
        join_on = ifelse(
            startsWith(nonemployementrate, "siehe Wkr"),
            gsub("\\D+", "", nonemployementrate),
            constituency_no
        ),
        nonemployementrate = NULL
    ) |>
    merge(
        subset(example_data, select = c("constituency_no", "nonemployementrate")),
        by.x = "join_on",
        by.y = "constituency_no"
    ) |>
    transform(join_on = NULL) # or subset(select = -join_on)
#   constituency_no constituency_name nonemployementrate
# 1             001     Hamburg-Mitte               0.04
# 2             002    Hamburg-Altona               0.04
# 3             003      Hamburg-Nord               0.04
# 4             004            Lübeck               0.03
# 5             005         Pinneberg               0.02

`dplyr` approach

Using dplyr you can do it by grouping rather than with a join. Essentially we create a column of the relevant constituency number (either the one to lookup or the one in that row), then group by this column and use the nonemployementrate rate for that constituency. This could be replicated in base R but feels less natural to me.

library(dplyr)
example_data |>
    mutate(
        lookup_rate = startsWith(nonemployementrate, "siehe Wkr"),
        nonemployementrate_num = if_else(
            lookup_rate,
            gsub("\\D+", "", nonemployementrate),
            constituency_no
        )
    ) |>
    mutate(
        nonemployementrate = nonemployementrate[!lookup_rate],
        .by = nonemployementrate_num
    ) |>
    select(constituency_no, constituency_name, nonemployementrate)

#   constituency_no constituency_name nonemployementrate
# 1             001     Hamburg-Mitte               0.04
# 2             002    Hamburg-Altona               0.04
# 3             003      Hamburg-Nord               0.04
# 4             004            Lübeck               0.03
# 5             005         Pinneberg               0.02

edited Oct 29, 2024 at 12:14

answered Oct 29, 2024 at 11:57

SamR

23.1k4 gold badges23 silver badges55 bronze badges

5 Comments

SamR Over a year ago

@Friede yes that looks good too - I think I am used to assigning to null because of data.table syntax but perhaps that is more explicit.

Carl Witthoft Over a year ago

Nice! Now someone needs to do a speed test of these and the other person's solution code.

SamR Over a year ago

@CarlWitthoft I am pretty sure that @Fride's solution would be faster than both of mine as it just uses match() whereas I am creating new columns and either joining or grouping.

Friede Over a year ago

I do not see any reason to compare running times here...

Carl Witthoft Over a year ago

@Friede Scalability is always an important factor in algorithm selection. Runtime for a few thousand records may not matter, but for a few million, yes.

Collectives™ on Stack Overflow

Is there a R function that detects a specific string and replaces it by the value of another observation based on a number within the string?

2 Answers 2

1 Comment

`dplyr` approach

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

dplyr approach

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related

`dplyr` approach