Select values based on other columns

Question

I have a dataframe (df, a sample of which is shown below). I want to choose values from column a1, b1 and c1 and take the average, if values in a2, b2, and c2 are positive. For example, in the first row of the df, all values in a2, b2, and c2 are positive, I then pick the corresponding values in a1, b1, and c1, and average them. The result is 0.4933. In the second column, only the value in c2 is positive, I will then pick the value in c1 (0.01).

a1       b1      c1      a2      b2      c2   desired outcome
0.51    0.49    0.48    0.05    0.03    0.09    0.493333
0.33    0.31    0.3    -0.03    -0.05   0.01    0.01
0.22    0.2     0.19    0.04    0.02    0.08    0.203333
0.54    0.52    0.51    -0.05   0.08    -0.01   0.08
0.45    0.43    0.42    -0.03   -0.05   0.01    0.01

Below is my code where I listed all scenarios. I am looking for more efficient codes that can handle more columns.

df2 <- df1 %>% select(c(a2,b2,c2)) %>% 
  mutate(outcome = ifelse(a2 >0 & b2>0 & c2>0, mean(a1,b1,c1),
                          ifelse(a2>0 & b2>0 &c2<0, mean(a1,b1),
                                 ifelse(a2>0&b2<0&c2<0, mean(a1),
                                        ifelse(a2<0&b2>0&c2>0, mean(b2,c2),
                                               ifelse(a2<0&b2<0&c2>0, mean(c2),
                                                      mean(b2)))))))

df1$`desired outcome`<- rowMeans(df1[ , grepl( "1" , names( df1 ) ) ] * (df1[ , grepl( "2" , names( df1 ) ) ]>0)) — M--
– M--, Commented Dec 10, 2018 at 17:43

G. Grothendieck · Accepted Answer · 2018-12-10 17:14:06Z

2

1) Here Mean does the calculation for one row and we apply it to each row separately. We are assuming here you want to zero elements in the first 3 columns whose corresponding column among the last 3 columns are positive and then take the mean of that.

Mean <- function(x) mean(x[1:3] * (x[4:6] > 0))
transform(df2, desired = apply(df2, 1, Mean))

giving:

    a1   b1   c1    a2    b2    c2   desired
1 0.51 0.49 0.48  0.05  0.03  0.09 0.4933333
2 0.33 0.31 0.30 -0.03 -0.05  0.01 0.1000000
3 0.22 0.20 0.19  0.04  0.02  0.08 0.2033333
4 0.54 0.52 0.51 -0.05  0.08 -0.01 0.1733333
5 0.45 0.43 0.42 -0.03 -0.05  0.01 0.1400000

2) or without apply:

transform(df2, desired = rowMeans(df2[1:3] * (df2[4:6] > 0)))

giving:

    a1   b1   c1    a2    b2    c2   desired
1 0.51 0.49 0.48  0.05  0.03  0.09 0.4933333
2 0.33 0.31 0.30 -0.03 -0.05  0.01 0.1000000
3 0.22 0.20 0.19  0.04  0.02  0.08 0.2033333
4 0.54 0.52 0.51 -0.05  0.08 -0.01 0.1733333
5 0.45 0.43 0.42 -0.03 -0.05  0.01 0.1400000

Note

The input df2 in reproducible form:

Lines <- "
a1       b1      c1      a2      b2      c2 
0.51    0.49    0.48    0.05    0.03    0.09
0.33    0.31    0.3    -0.03    -0.05   0.01
0.22    0.2     0.19    0.04    0.02    0.08
0.54    0.52    0.51    -0.05   0.08    -0.01
0.45    0.43    0.42    -0.03   -0.05   0.01"
df2 <- read.table(text = Lines, header = TRUE)

edited Dec 10, 2018 at 17:14

answered Dec 10, 2018 at 17:06

G. Grothendieck

273k18 gold badges221 silver badges365 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Roger Over a year ago

Thanks for the answer. I changed your second methods to transform(df2, desired = rowMeans(df2[1:3] * (df2[4:6] > 0))/rowSums(df2[4:6]>0)). This will take the average of columns whose value is larger than 0.

Emil Bode · Accepted Answer · 2018-12-10 19:02:03Z

Subsetting is just choosing some value based on some condition, but this need not be a condition based on this value itself.
Sounds hard, but is easy with an example:

 df[1,1:3][df[1,4:6]>0]

We take from the first row, the first three columns, but only those for which the corresponding values are TRUE. The coresponding values, are the answers to the questions "are you positive" to the first row, 4th-6th columns.

For this first row all three are TRUE, but for the 2nd one we only get one value: .3. And now we can just take the mean, and if we want to do it for all rows, we can use sapply:

outcome <- sapply(1:nrow(df), function(i) {mean(df[i,1:3][df[i,4:6]>0])})

Only if there are rows where a2, b2 and c2 are all three negative, then mean will return NaN, for "Not a Number"

Collectives™ on Stack Overflow

Select values based on other columns

2 Answers 2

Note

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Note

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related