4

I would like to predict values from a linear regression from multiple groups in a single dataframe. I have found the following blogpost which ALMOST does everything I need: https://www.r-bloggers.com/2016/09/running-a-model-on-separate-groups/

However, I cannot combine this with the predict() function with a newdata. For one group, I use the following:

m <- lm(y ~ x, df)
new_df <- data.frame(x=c(5))
predict(m, new_df)

this gives me the predicted value for y at x=5.

How do I do this when I have multiple groups in my df? This is what I tried:

df %>%
    nest(-group) %>%
    mutate(fit = map(data, ~ lm(.$y ~ .$x)),
           results = map(fit, predict)) %>%
    unnest(results)

When I try to use results = map(fit, predict(new_df)), I only get an error. Is there a way how I can pass my value for x (in this case 5) into the code above?

Ideally, I would get a new data.frame with two columns, group and the predicted y-value.

This is a sample data.frame:

group   x   y
g1  1   2
g1  1.5 3
g1  2   4
g1  2.3 4.4
g1  3   6
g1  3.4 6.2
g1  4.11    7
g1  4.8 7.9
g1  5   8
g1  5.3 8.2
g2  2   5
g2  2.3 4
g2  4   2.2
g2  4.4 1.9
g2  7   0.3

EDIT:

Plotting the sample data using ggplot2, I get the following plot:

ggplot(df, aes(x,y,colour=group)) +
 geom_point() +
 stat_smooth(method="lm", se=FALSE)

Plot

Using the following code, I get the sought after predicted y-values:

predict(lm(y ~ x, df[df$group =="g1", ]), new_df)
       1 
8.180285 

predict(lm(y ~ x, df[df$group =="g2", ]), new_df)
       1 
1.732136 

I would like to generate a new dataframe which should look something like this and contain the predicted y-value at x=5:

group   y_predict  
g1  8.180285  
g2  1.732136
1
  • As @Marcos Perez says below, this is a perfect case for splitting your dataframe into a list, and applying a lm function across the list elements. Commented Nov 30, 2020 at 17:01

3 Answers 3

3

Using the input shown reproducibly in the Note and since we only need the fitted values we don't need to use nest but can just use mutate:

library(dplyr)

df %>%
  group_by(group) %>%
  mutate(pred = fitted(lm(y ~ x))) %>%
  ungroup %>%
  select(group, pred)

giving:

# A tibble: 15 x 2
   group    pred
   <chr>   <dbl>
 1 g1     2.47  
 2 g1     3.19  
 3 g1     3.90  
 4 g1     4.33  
 5 g1     5.33  
 6 g1     5.90  
 7 g1     6.91  
 8 g1     7.89  
 9 g1     8.18  
10 g1     8.61  
11 g2     4.41  
12 g2     4.15  
13 g2     2.63  
14 g2     2.27  
15 g2    -0.0563

This could also be done like this:

library(dplyr)

df %>%
  mutate(pred = fitted(lm(y ~ x*group + 0, df))) %>%
  select(group, pred)

or like this using base R only:

transform(df, pred = fitted(lm(y ~ x*group + 0, df)))[c("group", "pred")]

or using lmList from nlme (which comes with R so it does not have to be installed):

library(dplyr)
library(nlme)

df %>%
  mutate(pred = fitted(lmList(y ~ x | group, df))) %>%
  select(group, pred)

or using lmList without dplyr:

library(nlme)

transform(df, pred = fitted(lmList(y ~ x | group, df)))[c("group", "pred")]

Note

Lines <- "
group   x   y
g1  1   2
g1  1.5 3
g1  2   4
g1  2.3 4.4
g1  3   6
g1  3.4 6.2
g1  4.11    7
g1  4.8 7.9
g1  5   8
g1  5.3 8.2
g2  2   5
g2  2.3 4
g2  4   2.2
g2  4.4 1.9
g2  7   0.3"
df <- read.table(text = Lines, header = TRUE)

Added

Regarding comment this code produces the prediction for x = 5 by group:

df %>%
  group_by(group) %>%
  summarize(pred = predict(lm(y ~ x), list(x = 5)), .groups = "drop") %>%
  select(group, pred)
## # A tibble: 2 x 2
##   group  pred
##   <chr> <dbl>
## 1 g1     8.18
## 2 g2     1.73
Sign up to request clarification or add additional context in comments.

3 Comments

I don't really get the output that you create. What are the values in "predict"? I would just like to get one prediction at x=5; i.e.: predict(lm(y ~ x, df[df$group =="g1", ]), new_df) This gives 8.180285 which is the value I am looking for.
I guess the values in "predict" are the predicted values for the "x" values in my df? But how do I get just one prediction for x=5, even if I don't have a value at x=5?
See secton Added which has been added to the end.
1
+50

This is a perfect case to use lapply function. Try this:

linear_model <- function(x) lm(y ~ x, x)
m <- lapply(split(df,df$group),linear_model)

Now, you have a list of linear models. Let's use it to predict y-value of your new_df for all models:

new_df <- data.frame(x=c(5))
my_predict <- function(m) predict(m,new_df)
sapply(m,my_predict)

Output:

#     g1.1     g2.1 
# 8.180285 1.732136

The output is numeric class with names.

3 Comments

Thank you! This is exactly what I wanted to do. Is there a way to get rid of the ".1" from the output?
Sure, ".1" is becouse of new_df row name. So you have to rename the new_df row like this new_df <- data.frame (x = c (5), row.names = c ("")), but it's better not to do it because when new_df has more than one row the output is matrix class and column names are "g1" and "g2". Try: new_df <- data.frame (x = c (5:6)).
Is there also a way to get m in the form of a data.frame? as.data.frame(m) gives the following error: Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class ‘"lm"’ to a data.frame
0

What you are describing is an estimation with varying intercepts and slopes.

With lm

You can do that directly using lm:

base = iris
names(base) = c("y", "x1", "x2", "x3", "species")

newdata = data.frame(x1 = 5, species = c("setosa", "versicolor", "virginica"))
res_1 = lm(y ~ species/x1, base)
newdata$y = predict(res_1, newdata)
newdata
#>   x1    species        y
#> 1  5     setosa 6.091450
#> 2  5 versicolor 7.865123
#> 3  5  virginica 8.414509

The shortcut species/x1 means species + species:x1, i.e. the factor variable and the interaction between the factor and the variable. Thus there will be one intercept and one coefficient associated to x1 for each group (here species).

Then the predict method can be used as usual which will lead to the requested result. This is done without needing loops nor lapply.

Alternative method

An alternative is to use specialized packages to estimate that kind of models, like for instance fixest. Since it is specialized in fixed-effects estimations, the run time will be substantially lower for large data sets.

library(fixest)

# Using variables with varying slopes
res_2 = feols(y ~ 1 | species[x1], base)
predict(res_2, newdata)
#>        1        2        3 
#> 6.091450 7.865123 8.414509

Some explanations:

  • Your group is the variable species here.
  • feols is the equivalent of lm but you can define fixed-effects after the pipe.
  • species[x1] means species fixed-effects (i.e. one intercept per species) + x1 having one coefficient per species (varying slopes).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.