0

I have a dataset with a population variable, as well as a few races ("white", "black", "hispanic"), and I want to be able to loop through the races so that for each race, a "percent_race" variable is created ("percent_white", etc.), and the race variable is then dropped.

I am most familiar with stata, where you can designate the string you are looping through within the loop using a `'. This allows me to name the new variables using a string from my loop that also serves to indicate what variables should be used in the formula for calculating those new variables. Here is what I mean:

loc races white black hispanic

foreach race in races {
   generate `race'_percentage = (population/`race')*100
   drop `race'
   }

In R, I want something to the same effect:

races <- list("white", "black", "hispanic")

df %>%
   for (race in races) {
      mutate(percent_"race" = (population/race)*100) %>%
      select(df, -c(race)) %>%
      }

I threw the quotes around race when naming the variable as a filler; I know that doesn't work, but you see how I want the variables to be named.

There might be other things wrong with how I am approaching this in R. I've always done data transformation and analysis in stata and moved to R for visualization, but I'm trying to learn to do it all in R. I'm not even sure if using a for loop within a pipe is proper here, but it makes sense to me within this little problem I have created for myself.

1
  • Can you post the format of your data with dput(head(df))? I think what you are asking should be quite straightforward but it's not clear what your data looks like - i.e. what is being divided by what. See here for more. Commented Jul 20, 2022 at 12:09

2 Answers 2

1

Your stata code implies a certain structure of df, namely, that there are separate columns for white, black, and hispanic. In that case, the structure should look something like the sample data I have constructed below, and suggests that you can use mutate(across()) to transform the three variables.

races <- c("white", "black", "hispanic")
df %>% 
  mutate(across(all_of(races), ~.x*100/population,.names = "percent_{.col}")) %>%
  select(-all_of(races))

Output:

   population percent_white percent_black percent_hispanic
1       71662     96.303480     0.5288716         3.167648
2       77869     90.231029     4.0503923         5.718579
3       22985     69.071133    12.7996519        18.129215
4       49924     79.546911     7.5454691        12.907620
5       88292      2.462284    14.8699769        82.667739
6       82554     47.779635     7.2485888        44.971776
7       65403     75.846674     5.6297112        18.523615
8       85160     21.641616    36.5124472        41.845937
9       66434     31.819550    18.1352922        50.045158
10      29641     23.163861    65.9154549        10.920684

Input:

set.seed(123)
df = data.frame(population=sample(20000:100000, size = 10)) %>% 
  mutate(
    white = ceiling(population*runif(10)),
    black = ceiling((population-white)*runif(10)),
    hispanic = population-white-black
)

   population white black hispanic
1       71662 69013   379     2270
2       77869 70262  3154     4453
3       22985 15876  2942     4167
4       49924 39713  3767     6444
5       88292  2174 13129    72989
6       82554 39444  5984    37126
7       65403 49606  3682    12115
8       85160 18430 31094    35636
9       66434 21139 12048    33247
10      29641  6866 19538     3237
Sign up to request clarification or add additional context in comments.

3 Comments

That is exactly what I needed and makes a lot of sense for the most part. Where can I read up on the exact syntax you used there? I'm a little confused by the "~.x" and the "{.col}". I've practiced regexes with grepl and gsub a bit, but I don't exactly get how you knew what to put there.
the second argument in across() is the function that you want to apply to each of the columns indicated in the first argument. Using ~ is the tidy (purrr) approach to lambda functions; the .x is a stand in for the column (similar to python's lambda x: x.upper(), for example). The .names is a way to apply a glue style approach to renaming the columns.. this is a string specification where I have used both a simple string "percent_" combined with the special {.col} which refers to the name of the current column.
That makes a lot of sense. Thank you so much!
0

It's atypical if not explicitly unallowed to pipe a data frame into a for loop like that. A more typical and tidy way would be something like reshaping the data to summarize:

df <- data.frame(
  id = c('1', '2', '3'),
  population = c(100, 200, 300),
  white = c(50, 75, 100),
  black = c(25, 50, 150),
  hispanic = c(25, 75, 50)
)

df %>%
  tidyr::pivot_longer(!c(id, population)) %>%
  dplyr::mutate(percent = value/population) %>% 
  tidyr::pivot_wider(c(id, population), names_from = name, names_prefix = "percent_")

This code takes the wide data, reshapes it to long (so each 'id/race' combination is unique), calculates the percent, and then goes back to a wide format with the names percent_'race'.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.