Subset columns based on certain columns missing value

Question

My dataset is pretty big. I have about 2,000 variables and 1,000 observations. I want to run a model for each variable using other variables. To do so, I need to drop variables which have missing values where the dependent variable doesn't have.

I meant that for instance, for variable "A" I need to drop variable C and D because those have missing values where variable A doesn't have. for variable "C" I can keep variable "D".

data <- read.table(text="
A  B  C  D
1  3  9  4
2  1  3  4
NA NA 3  5
4  2  NA NA
2  5  4  3
1  1  1  2",header=T,sep="")

I think I need to make a loop to go through each variable.

Taking the first part of your "I meant" sentence, Would you also be dropping the row in A that has A==NA? — hrbrmstr
– hrbrmstr, Commented Mar 27, 2014 at 1:16
yes I would like to do so too. but the main problem I would like to solve is drop variables based on missing values.. — user976856
– user976856, Commented Mar 27, 2014 at 1:18
What will you choose when your dependent is "D"? Or, does something like that never happen on your actual data? — alexis_laz
– alexis_laz, Commented Mar 27, 2014 at 10:57
@alexis_laz that is a good point. then there would be no data. this is an example I made. in my actual data there are too many variables, so I am not really concerned about it. let me change the example. thank you! — user976856
– user976856, Commented Mar 27, 2014 at 13:31

hrbrmstr · Accepted Answer · 2014-03-27 01:37:48Z

1

I think this gets what you need:

for (i in 1:ncol(data)) {

  # filter out rows with NA's in on column 'i'
  # which is the column we currently care about 

  tmp <- data[!is.na(data[,i]),]

  # now column 'i' has no NA values, so remove other columns
  # that have NAs in them from the data frame

  tmp <- tmp[sapply(tmp, function(x) !any(is.na(x)))] 

  #run your model on 'tmp'

}

For each iteration of i, the tmp data frame looks like:

'data.frame':   5 obs. of  2 variables:
 $ A: int  1 2 4 2 1
 $ B: int  3 1 2 5 1

'data.frame':   5 obs. of  2 variables:
 $ A: int  1 2 4 2 1
 $ B: int  3 1 2 5 1

'data.frame':   4 obs. of  2 variables:
 $ C: int  3 3 4 1
 $ D: int  4 5 3 2

'data.frame':   5 obs. of  1 variable:
 $ D: int  4 4 5 3 2

answered Mar 27, 2014 at 1:37

hrbrmstr

79.1k11 gold badges146 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Michele · Accepted Answer · 2014-03-27 14:04:42Z

1

I'll provide a way to get the usable vadiables for each column you choose:

getVars <- function(data, col){
  tmp<-!sapply(data[!is.na(data[[col]]),], function(x) { any(is.na(x)) })
  names(data)[tmp & names(data) != col]
}

PS: I'm on my phone so I didn't test the above nor had the chance for a good code styling.

EDIT: Styling fixed!

edited Mar 27, 2014 at 14:04

answered Mar 27, 2014 at 1:37

Michele

8,7937 gold badges51 silver badges74 bronze badges

2 Comments

alexis_laz Over a year ago

+1! I guess that's what OP's looking for. There's just a ", " missing when subsetting data. You could, also, build tmp without sapply like tmp <- colSums(is.na(data[!is.na(data[[col]]), ])) == 0 and, perhaps, modify the output so it doesn't return col again?

Michele Over a year ago

@alexis_laz You're right. I would've done with a PC :). I'll fix later

Collectives™ on Stack Overflow

Subset columns based on certain columns missing value

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related