0

My dataset is pretty big. I have about 2,000 variables and 1,000 observations. I want to run a model for each variable using other variables. To do so, I need to drop variables which have missing values where the dependent variable doesn't have.

I meant that for instance, for variable "A" I need to drop variable C and D because those have missing values where variable A doesn't have. for variable "C" I can keep variable "D".

data <- read.table(text="
A  B  C  D
1  3  9  4
2  1  3  4
NA NA 3  5
4  2  NA NA
2  5  4  3
1  1  1  2",header=T,sep="")

I think I need to make a loop to go through each variable.

4
  • 1
    Taking the first part of your "I meant" sentence, Would you also be dropping the row in A that has A==NA? Commented Mar 27, 2014 at 1:16
  • yes I would like to do so too. but the main problem I would like to solve is drop variables based on missing values.. Commented Mar 27, 2014 at 1:18
  • What will you choose when your dependent is "D"? Or, does something like that never happen on your actual data? Commented Mar 27, 2014 at 10:57
  • @alexis_laz that is a good point. then there would be no data. this is an example I made. in my actual data there are too many variables, so I am not really concerned about it. let me change the example. thank you! Commented Mar 27, 2014 at 13:31

2 Answers 2

1

I think this gets what you need:

for (i in 1:ncol(data)) {

  # filter out rows with NA's in on column 'i'
  # which is the column we currently care about 

  tmp <- data[!is.na(data[,i]),]

  # now column 'i' has no NA values, so remove other columns
  # that have NAs in them from the data frame

  tmp <- tmp[sapply(tmp, function(x) !any(is.na(x)))] 

  #run your model on 'tmp'

}

For each iteration of i, the tmp data frame looks like:

'data.frame':   5 obs. of  2 variables:
 $ A: int  1 2 4 2 1
 $ B: int  3 1 2 5 1

'data.frame':   5 obs. of  2 variables:
 $ A: int  1 2 4 2 1
 $ B: int  3 1 2 5 1

'data.frame':   4 obs. of  2 variables:
 $ C: int  3 3 4 1
 $ D: int  4 5 3 2

'data.frame':   5 obs. of  1 variable:
 $ D: int  4 4 5 3 2
Sign up to request clarification or add additional context in comments.

Comments

1

I'll provide a way to get the usable vadiables for each column you choose:

getVars <- function(data, col){
  tmp<-!sapply(data[!is.na(data[[col]]),], function(x) { any(is.na(x)) })
  names(data)[tmp & names(data) != col]
}

PS: I'm on my phone so I didn't test the above nor had the chance for a good code styling.

EDIT: Styling fixed!

2 Comments

+1! I guess that's what OP's looking for. There's just a ", " missing when subsetting data. You could, also, build tmp without sapply like tmp <- colSums(is.na(data[!is.na(data[[col]]), ])) == 0 and, perhaps, modify the output so it doesn't return col again?
@alexis_laz You're right. I would've done with a PC :). I'll fix later

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.