2

This seems simple enough, but can't figure it out. I'd like to create a new column in df2 (impute_id) that identifies whether or not the value (measurement) was imputed or if it is the raw, observed value from df1. If the rows match, then in the new column in df2, impute_id, assign the string observed and if the rows do not match, then assign the string imputed. I'd like to do this using dplyr if possible. Also to note, the rows in the data frames may not be in the same order even though they are in the example.


Example

Raw data

df1
   time protocol     measurement_type sample measurement
1     0     HPLC cis,cis-Muconic acid      a     0.57561
2     0     HPLC            D-Glucose      a          NA
3     0     HPLC cis,cis-Muconic acid      a          NA
4     0     HPLC            D-Glucose      b          NA
5     0    OD600      Optical Density      b     0.14430
6    22     HPLC cis,cis-Muconic acid      b          NA
7    22     HPLC            D-Glucose      a          NA
8    22    OD600      Optical Density      a          NA
9    24     HPLC cis,cis-Muconic acid      a          NA
10   24     HPLC            D-Glucose      b    33.95529

Imputed Data

df2
   time protocol     measurement_type sample measurement
1     0     HPLC cis,cis-Muconic acid      a     0.57561
2     0     HPLC            D-Glucose      a    33.95529
3     0     HPLC cis,cis-Muconic acid      a     0.57561
4     0     HPLC            D-Glucose      b    33.95529
5     0    OD600      Optical Density      b     0.14430
6    22     HPLC cis,cis-Muconic acid      b     0.57561
7    22     HPLC            D-Glucose      a    33.95529
8    22    OD600      Optical Density      a     0.14430
9    24     HPLC cis,cis-Muconic acid      a     0.57561
10   24     HPLC            D-Glucose      b    33.95529

Desired Output

df2
   time protocol     measurement_type sample measurement  impute_id
1     0     HPLC cis,cis-Muconic acid      a     0.57561   observed
2     0     HPLC            D-Glucose      a    33.95529    imputed
3     0     HPLC cis,cis-Muconic acid      a     0.57561    imputed
4     0     HPLC            D-Glucose      b    33.95529    imputed
5     0    OD600      Optical Density      b     0.14430   observed
6    22     HPLC cis,cis-Muconic acid      b     0.57561    imputed
7    22     HPLC            D-Glucose      a    33.95529    imputed
8    22    OD600      Optical Density      a     0.14430    imputed
9    24     HPLC cis,cis-Muconic acid      a     0.57561    imputed
10   24     HPLC            D-Glucose      b    33.95529   observed

Reproducible Data

Raw Data

df1 <- structure(list(time = c(0L, 0L, 0L, 0L, 0L, 22L, 22L, 22L, 24L, 
24L), protocol = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 
1L, 1L), .Label = c("HPLC", "OD600"), class = "factor"), measurement_type = structure(c(1L, 
2L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), .Label = c("cis,cis-Muconic acid", 
"D-Glucose", "Optical Density"), class = "factor"), sample = c("a", 
"a", "a", "b", "b", "b", "a", "a", "a", "b"), measurement = c(0.57561, 
NA, NA, NA, 0.1443, NA, NA, NA, NA, 33.95529)), row.names = c(NA, 
-10L), class = "data.frame")

Imputed Data

df2 <- structure(list(time = c(0L, 0L, 0L, 0L, 0L, 22L, 22L, 22L, 24L, 
24L), protocol = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 
1L, 1L), .Label = c("HPLC", "OD600"), class = "factor"), measurement_type = structure(c(1L, 
2L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), .Label = c("cis,cis-Muconic acid", 
"D-Glucose", "Optical Density"), class = "factor"), sample = c("a", 
"a", "a", "b", "b", "b", "a", "a", "a", "b"), measurement = c(0.57561, 
33.95529, 0.57561, 33.95529, 0.1443, 0.57561, 33.95529, 0.1443, 
0.57561, 33.95529)), row.names = c(NA, -10L), class = "data.frame")

1 Answer 1

1

Maybe something like

library(dplyr)

df1 %>%
  group_by(measurement_type) %>%
  mutate(impute_id = ifelse(is.na(measurement), "imputed", "observed"),
         measurement = min(measurement, na.rm = TRUE))

   time protocol     measurement_type sample measurement  impute_id
1     0     HPLC cis,cis-Muconic acid      a     0.57561 observed
2     0     HPLC            D-Glucose      a    33.95529  imputed
3     0     HPLC cis,cis-Muconic acid      a     0.57561  imputed
4     0     HPLC            D-Glucose      b    33.95529  imputed
5     0    OD600      Optical Density      b     0.14430 observed
6    22     HPLC cis,cis-Muconic acid      b     0.57561  imputed
7    22     HPLC            D-Glucose      a    33.95529  imputed
8    22    OD600      Optical Density      a     0.14430  imputed
9    24     HPLC cis,cis-Muconic acid      a     0.57561  imputed
10   24     HPLC            D-Glucose      b    33.95529 observed
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.