Needing advice on linear regression and then replacing NA's with fitted values

Question

I am quite new to the data analytics stuff and R/RStudio so I am in need of advice. I am doing a project and asked to do:

for every variable that has missing value to run a linear regression model using all the rows that don't have NAs. Then I need to replace the NA's with the fitted values of every model I ran.

Variables are: price, sqm, age, feats, ne, cor, tax.
The variables with missing values are age and tax.

Dna <- apply(is.na(Data), 2, which)
lmAGE <- lm(AGE ~ PRICE + SQM + FEATS, Data)
lmTAX <- lm(TAX ~ PRICE + SQM + FEATS, Data)
na <- apply(is.na(Data), 1, which)
for (i in na) {
  prAGE <- predict(lmAGE, interval="prediction")
  prTAX <- predict(lmTAX, new, interval="prediction")
}

My problem is, that lm doesn't take into consideration the NA's, so predict does the same thing, I am currently struggling to think of a way of solving this.

If I use addNA, could this work?

Or if I use

new <- data.frame(years=c(10, 20))

Something like that, but then I cannot add all the other non-NA variables.

I've tried using multiple for, had an idea using if, but I think that was too advanced to me, as I am really new to this and have multiple errors along the way.

Edward · Accepted Answer · 2025-05-08 02:55:05Z

You didn't provide any data, so I'll use the mtcars dataset to illustrate how to impute the "mpg" variable.

cars <- mtcars 
set.seed(1)  # Make this reproducible
cars[sample(nrow(cars), 5), "mpg"] <- NA  # Generate 5 missing values in mpg at random
head(cars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           NA   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       NA   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      NA   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

for every variable that has missing value to run a linear regression model using all the rows that don't have NAs.

# Run the regression on mpg (NA are excluded by default)
lm.mpg <- lm(mpg~., data=cars)  

# Create a data frame containing only records with NA in mpg
newdata <- cars[is.na(cars$mpg),]

# Predict the values of mpg which are NA using the values from all other variables
(p.mpg <- predict(lm.mpg, newdata))

       Mazda RX4    Mazda RX4 Wag   Hornet 4 Drive       Duster 360 Pontiac Firebird 
        24.47857         24.10666         20.25447         13.84317         15.90702

A better way to do this is to use the mice package. It uses Gibbs sampling to generate multivariate imputations using chained equations.

library(mice)
imp <- mice(cars) # The default is to generate m=5 imputations
imp$imp$mpg
                    1    2    3    4    5
Mazda RX4        21.4 22.8 22.8 27.3 24.4
Mazda RX4 Wag    21.5 26.0 21.5 30.4 19.7
Hornet 4 Drive   15.8 15.2 19.2 15.2 21.5
Duster 360       13.3 16.4 16.4 17.3 16.4
Pontiac Firebird 16.4 15.5 18.7 10.4 19.2

rowMeans(imp$imp$mpg)
       Mazda RX4    Mazda RX4 Wag   Hornet 4 Drive       Duster 360 Pontiac Firebird 
           23.74            23.82            17.38            15.96            16.04

These are fairly similar to the first values we obtained manually above.

Then I need to replace the NA's with the fitted values of every model I ran.

cars$mpg[is.na(cars$mpg)] <- p.mpg

You can use the same method above for other variables that have missing values. The mice package is convenient as it allows you to do this all in one command and is the gold standard method for missing value imputation.

jay.sf · Accepted Answer · 2025-05-08 07:49:05Z

You can use the indices stored in the "na.action" component. Example using modified mtcars:

> head(mtcars2)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90    NA 2.620  0  1    4    4  #
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 2.875  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 2.320  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215    NA  1  0    3    1  #
Hornet Sportabout 18.7   8  360 175 3.15    NA 3.440  0  0    3    2  #
Valiant           18.1   6  225 105 2.76 3.460 3.460  1  0    3    1

Fit the models:

> fit_wt <- lm(wt ~ mpg + cyl + disp + hp + drat + vs + am + gear + carb, mtcars2)
> fit_qsec <- lm(qsec ~ mpg + cyl + disp + hp + drat + vs + am + gear + carb, mtcars2)

Notice the na.action component:

> fit_wt$na.action
        Mazda RX4 Hornet Sportabout Chrysler Imperial 
                1                 5                17 
attr(,"class")
[1] "omit"

You can then replace the missings with the fitted values:

> mtcars2 |> 
+   transform(wt=replace(wt, fit_wt$na.action, fit_wt$fitted.values[fit_wt$na.action]),
+             qsec=replace(qsec, fit_qsec$na.action, fit_qsec$fitted.values[fit_qsec$na.action]))
                   mpg cyl disp  hp drat       wt     qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.983000 2.620000  0  1    4    4  #
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875000 2.875000  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320000 2.320000  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215000 3.825437  1  0    3    1  #
Hornet Sportabout 18.7   8  360 175 3.15 4.014875 3.440000  0  0    3    2  #
Valiant           18.1   6  225 105 2.76 3.460000 3.460000  1  0    3    1
...

Or simply

> mtcars2$wt[fit_wt$na.action] <- fit_wt$fitted.values[fit_wt$na.action]
> mtcars2$qsec[fit_qsec$na.action] <- fit_qsec$fitted.values[fit_qsec$na.action]

Additionally, we could round to the given accuracy.

> dlen <- \(x) {
+   ## returns max number of decimal digits
+   max(nchar(gsub(".*\\.", "", na.omit(x))))
+ }
> mtcars2 |> 
+   transform(wt=replace(wt, fit_wt$na.action, fit_wt$fitted.values[fit_wt$na.action]) |> 
+               round(dlen(mtcars2$qsec)),
+             qsec=replace(qsec, fit_qsec$na.action, fit_qsec$fitted.values[fit_qsec$na.action]) |> 
+               round(dlen(mtcars2$wt)))
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.983 2.620  0  1    4    4  #
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 2.875  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 2.320  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 3.825  1  0    3    1  #
Hornet Sportabout 18.7   8  360 175 3.15 4.015 3.440  0  0    3    2  #
Valiant           18.1   6  225 105 2.76 3.460 3.460  1  0    3    1
...

Note that in model.frame—used internally by lm—all variables referenced in the formula are included in the data frame, even those preceded by -. So when abbreviating a model as wt ~ . - qsec, the qsec variable is not excluded prior to evaluation:

> lm(wt ~ . - qsec, mtcars2a)$na.action
        Mazda RX4    Hornet 4 Drive Hornet Sportabout          Merc 280 Chrysler Imperial  Pontiac Firebird 
                1                 4                 5                10                17                25 
attr(,"class")
[1] "omit"

Statistical note: Make sure that you use the appropriate model; in this example, the outcome variable is continuous, so OLS is perfectly fine. However, for binary or count outcomes, you'd need logistic or Poisson regression, respectively.

Data:

set.seed(42)
mtcars2 <- transform(mtcars, 
                     wt=replace(wt, sample.int(nrow(mtcars), nrow(mtcars)*.1), NA), 
                     qsec=replace(wt, sample.int(nrow(mtcars), nrow(mtcars)*.1), NA))

Collectives™ on Stack Overflow

Needing advice on linear regression and then replacing NA's with fitted values

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related