2

I have one problem would you like to give me a hand. I tried to come up with solution, but I do not have any idea how to work it out.

Please use this to recreate my dataframe.

structure(list(A1 = c(87L, 67L, 80L, 36L, 71L, 6L, 26L, 15L, 
14L, 46L, 19L, 93L, 5L, 94L), A2 = c(50L, NA, 73L, 58L, 47L, 
74L, 39L, NA, NA, NA, NA, NA, NA, NA), A3 = c(NA, 38L, 10L, 41L, 
NA, 66L, NA, 7L, 29L, NA, 70L, 23L, 46L, 55L)), .Names = c("A1", 
"A2", "A3"), class = "data.frame", row.names = c(NA, -14L))

I have this dataframe:

A1  A2  A3
87  50  NA
67  NA  38
80  73  10
36  58  41
71  47  NA
6   74  66
26  39  NA
15  NA  7
14  NA  29
46  NA  NA
19  NA  70
93  NA  23
5   NA  46
94  NA  55

What is the way to slice dataframe where we have greater or equal of 7 observations(count) per columns? So, the desired output look like this (we have obervation >= 7 per column):

A1  A3
87  NA
67  38
80  10
36  41
71  NA
6   66
26  NA
15  7
14  29
46  NA
19  70
93  23
5   46
94  55

I welcome any solution that can generalize to more columns.

4
  • 2
    Read ?colSums. If you struggle with that please share the first data frame using dput. Commented Dec 6, 2018 at 8:13
  • I do not got your comment. What do you mean? Commented Dec 6, 2018 at 8:15
  • 1
    Sorry for not being clear here. The function you mainly need is colSums. Use dput to share your data, see How to make a great R reproducible example Commented Dec 6, 2018 at 8:17
  • I made amendments into my question in order to make reproducible. I do not know how I can use colSums. Would you like to share how please? Commented Dec 6, 2018 at 8:31

1 Answer 1

6

Try

df1[, colSums(!is.na(df1)) >= 7]
#   A1 A3
#1  87 NA
#2  67 38
#3  80 10
#4  36 41
#5  71 NA
#6   6 66
#7  26 NA
#8  15  7
#9  14 29
#10 46 NA
#11 19 70
#12 93 23
#13  5 46
#14 94 55

step by step

What you need to do first is to find out which values of your data are not missing.

!is.na(df1)

This returns a logical matrix

#        A1    A2    A3
# [1,] TRUE  TRUE FALSE
# [2,] TRUE FALSE  TRUE
# [3,] TRUE  TRUE  TRUE
# [4,] TRUE  TRUE  TRUE
# [5,] TRUE  TRUE FALSE
# [6,] TRUE  TRUE  TRUE
# [7,] TRUE  TRUE FALSE
# [8,] TRUE FALSE  TRUE
# [9,] TRUE FALSE  TRUE
#[10,] TRUE FALSE FALSE
#[11,] TRUE FALSE  TRUE
#[12,] TRUE FALSE  TRUE
#[13,] TRUE FALSE  TRUE
#[14,] TRUE FALSE  TRUE

Use colSums to find out how many observations per column are not missing

colSums(!is.na(df1))
#A1 A2 A3 
#14  6 10

Apply you condition "greater or equal of 7 observations(count) per columns"

colSums(!is.na(df1)) >= 7
#   A1    A2    A3 
# TRUE FALSE  TRUE

Finally, you need to use this vector to subset your data

df1[, colSums(!is.na(df1)) >= 7]

Turn this into a function if you need it regulary

almost_complete_cols <- function(data, min_obs) {
  data[, colSums(!is.na(data)) >= min_obs, drop = FALSE]
}

almost_complete_cols(df1, 7)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.