2

So, I am using constituency data of the German Election 1994 and some observations contain strings that indicate that the value is given in a different row (based on the Scheme "siehe Wkr xxx" - "see constituency xxx"). As an example, the non employement rate in Hamburg-Altona is only collected for Hamburg in general, so the constituency Hamburg-Altona should take the value of the observation Hamburg-Mitte.

example_data <- data.frame(constituency_no = c("001", "002", "003", "004", "005"),
                           constituency_name = c("Hamburg-Mitte", "Hamburg-Altona", "Hamburg-Nord", "Lübeck", "Pinneberg"),
                          nonemployementrate = c(0.04, "siehe Wkr 001", "siehe Wkr 001", 0.03, 0.02))

So, I want a function that automatically detects if there is a string beginning with "siehe Wkr " and then replace the value of that string with the value from the constituency number referred to. So in the example I want a function that automatically replaces the value of nonemployementrate with 0.04, as the string for Hamburg-Altona and Hamburg-Nord refers to constituency_no "001".

result <- data.frame(constituency_no = c("001", "002", "003", "004", "005"),
                           constituency_name = c("Hamburg-Mitte", "Hamburg-Altona", "Hamburg-Nord", "Lübeck", "Pinneberg"),
                          nonemployementrate = c(0.04, 0.04, 0.04, 0.03, 0.02))

2 Answers 2

2

At the risk of overlooking something relevant.

within(example_data, {
  i = startsWith(nonemployementrate, "siehe")
  nonemployementrate[i] = nonemployementrate[
    match(sub("\\D+", "", nonemployementrate[i]), constituency_no)]
  rm(i)
})

giving

  constituency_no constituency_name nonemployementrate
1             001     Hamburg-Mitte               0.04
2             002    Hamburg-Altona               0.04
3             003      Hamburg-Nord               0.04
4             004            Lübeck               0.03
5             005         Pinneberg               0.02

Edit. A simple function. (You ask for one.)

f = \(X) {
  stopifnot(is.data.frame(X), 
            c("nonemployementrate", "constituency_no") %in% names(X))
  i = startsWith(X$nonemployementrate, "siehe")
  r = match(sub("\\D+", "", X$nonemployementrate[i]), X$constituency_no)
  X$nonemployementrate[i] = X$nonemployementrate[r]
  X
}
f(example_data)
Sign up to request clarification or add additional context in comments.

1 Comment

To the downvoter. Please explain why the downvote. Thank you.
1

Here is an approach which does what you describe in base R, using joins. It checks whether the nonemployementrate starts with "siehe Wkr", and if so it uses the nonemployementrate of the constituency to which it refers. If not it uses the nonemployementrate that is already present.

example_data |>
    transform(
        join_on = ifelse(
            startsWith(nonemployementrate, "siehe Wkr"),
            gsub("\\D+", "", nonemployementrate),
            constituency_no
        ),
        nonemployementrate = NULL
    ) |>
    merge(
        subset(example_data, select = c("constituency_no", "nonemployementrate")),
        by.x = "join_on",
        by.y = "constituency_no"
    ) |>
    transform(join_on = NULL) # or subset(select = -join_on)
#   constituency_no constituency_name nonemployementrate
# 1             001     Hamburg-Mitte               0.04
# 2             002    Hamburg-Altona               0.04
# 3             003      Hamburg-Nord               0.04
# 4             004            Lübeck               0.03
# 5             005         Pinneberg               0.02

dplyr approach

Using dplyr you can do it by grouping rather than with a join. Essentially we create a column of the relevant constituency number (either the one to lookup or the one in that row), then group by this column and use the nonemployementrate rate for that constituency. This could be replicated in base R but feels less natural to me.

library(dplyr)
example_data |>
    mutate(
        lookup_rate = startsWith(nonemployementrate, "siehe Wkr"),
        nonemployementrate_num = if_else(
            lookup_rate,
            gsub("\\D+", "", nonemployementrate),
            constituency_no
        )
    ) |>
    mutate(
        nonemployementrate = nonemployementrate[!lookup_rate],
        .by = nonemployementrate_num
    ) |>
    select(constituency_no, constituency_name, nonemployementrate)

#   constituency_no constituency_name nonemployementrate
# 1             001     Hamburg-Mitte               0.04
# 2             002    Hamburg-Altona               0.04
# 3             003      Hamburg-Nord               0.04
# 4             004            Lübeck               0.03
# 5             005         Pinneberg               0.02

5 Comments

@Friede yes that looks good too - I think I am used to assigning to null because of data.table syntax but perhaps that is more explicit.
Nice! Now someone needs to do a speed test of these and the other person's solution code.
@CarlWitthoft I am pretty sure that @Fride's solution would be faster than both of mine as it just uses match() whereas I am creating new columns and either joining or grouping.
I do not see any reason to compare running times here...
@Friede Scalability is always an important factor in algorithm selection. Runtime for a few thousand records may not matter, but for a few million, yes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.