1

I have a data set with some null values in one field. When I try to run a linear regression, it treats the integers in the field as category indicators, not numbers.

E.g., for a field that contains no null values...

summary(lm(rank ~ num_ays, data=a)),

Returns:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.607597   0.019927 532.317  < 2e-16 ***
num_ays      0.021955   0.007771   2.825  0.00473 ** 

But when I run the same model on a field with null values, I get:

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    

(Intercept)  1.225e+01  1.070e+00   11.446  < 2e-16 ***
num_azs0    -1.780e+00  1.071e+00   -1.663  0.09637 .  
num_azs1    -1.103e+00  1.071e+00   -1.030  0.30322    
num_azs10   -9.297e-01  1.080e+00   -0.861  0.38940    
num_azs100   1.750e+00  5.764e+00    0.304  0.76141    
num_azs101  -6.250e+00  4.145e+00   -1.508  0.13161    

What's the best and/or most efficient way to handle this, and what are the tradeoffs?

4
  • Speaking null you got NA on mind? Is there chance that num_azs is a factor? Looks like not cleaned data for me... Commented Oct 25, 2010 at 19:50
  • I don't think it's a factor. Both num_ays and num_azs came from a MySQL export. Field type for both is integer, but num_azs can contain null values. Commented Oct 25, 2010 at 19:56
  • what does summary(a) say your data columns are? I guess a non numeric value is causing conversion to factor. Solution is to convert to numeric using as.numeric (as.character(foo)) Commented Oct 25, 2010 at 20:52
  • Thanks, Marek et al—turns out it's listed as a factor. I'll seek my answers in a different question. Commented Oct 25, 2010 at 21:33

2 Answers 2

3

You can ignore null values like so:

a[!is.null(a$num_ays),]
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, Shane. I tried to apply that using: summary(lm(rank ~ num_ays, data=a[!is.null(a$num_ays)])). It gave me the same output, though.
is.null returns TRUE if object is NULL and FALSE otherwise. So your construct returns all rows of a or 0-row data.frame. I'm pretty sure you was thinking about is.na ;)
2

And to build on Shane's answer: you can use that in the data= argument of lm():

summary(lm(rank ~ num_ays, data=a[!is.null(a$num_ays),]))

3 Comments

Thanks, Dirk. I tried that but it's still treating the numbers in the column as category labels... same result as before. Am I missing something else as well?
You are being tripped up by factors. That is a different issue. Try and search for "[r] factor" (ie the term factor within posts tagged [r] for R). You will need to read the data differently, and/or convert it.
Isn't better to use subset argument of lm?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.