Imputing Missing Values in R from reference data frame

Question

I have a data frame 'dat' of dim 17000 x 3 of walking data. The interval column is 5 minute intervals for each 24 hour period, the date column is the date and the steps column is the number of steps taken in said 5 minute period on said date. NA's are present.

> head(df1)
  steps       date interval
1    NA 2012-10-01        0
2    NA 2012-10-01        5
3    NA 2012-10-01       10
4    NA 2012-10-01       15
5    NA 2012-10-01       20
6    NA 2012-10-01       25

I've used dplyr to group my df by date and then created a new df 'df.1' and summarized it as avg=mean(df.1$steps, na.rm = TRUE). This gives me a nice little df of the mean value of steps on each date

         date      avg
1  2012-10-01      NaN
2  2012-10-02  0.43750
3  2012-10-03 39.41667
4  2012-10-04 42.06944
5  2012-10-05 46.15972
6  2012-10-06 53.54167

What I would like to do is update my original df's NA-values with the mean value from each date.

So in the first table where 2012-10-02 was NA then I'd like to replace ever NA value in table one for 2012-10-02 with the value 0.43750. I've tried using indices, which, %in%, apply family and just can't find anything that is sticking.

Any help would be greatly appreciated.

Have you tried merge. Also, if you have used dplyr, mutate would be an option to add the column to the original dataset instead of summarise — akrun
– akrun, Commented Aug 5, 2015 at 22:18
So maybe use rownames or index values and merge on like index values? — Zach
– Zach, Commented Aug 5, 2015 at 22:19
May be library(dplyr); df1 %>% group_by(date) %>% mutate(avg= mean(steps, na.rm=TRUE)) — akrun
– akrun, Commented Aug 5, 2015 at 22:21
I already have that bit (the mean calculated). What I need is to then update every NA value in original df with the mean value for that given day. — Zach
– Zach, Commented Aug 5, 2015 at 22:23
What I meant is that you don't need to create a second dataset instead you can do this in one step by mutate. If you need to do, then merge(df1, df1.1, by='date', all=TRUE) and then change the NA value in steps by the new column — akrun
– akrun, Commented Aug 5, 2015 at 22:23

ulfelder · Accepted Answer · 2015-08-05 22:54:13Z

2

This is a little clunky, but it works:

library(dplyr)
df1.1 <- df1 %>%
    group_by(date) %>%
    summarise(avg = mean(steps, na.rm = TRUE)) %>%
    merge(df1, ., all.x=TRUE) %>%
    mutate(steps = ifelse(is.na(steps)==TRUE, avg, steps)) %>%
    select(-avg)

Here's my toy data:

df1 <- data.frame(date = c(rep("2015-01-01", 12), rep("2015-01-02", 12)), interval = rep(seq(12), 2),
    steps = c(5, 7, NA, 12, 3, NA, 0, 4, 12, 10, 4, 0, 3, NA, 2, 1, NA, 15, 0, 4, 7, 2, NA, 2),
    stringsAsFactors = FALSE)

Which looks like:

> head(df1)
        date interval steps
1 2015-01-01        1     5
2 2015-01-01        2     7
3 2015-01-01        3    NA
4 2015-01-01        4    12
5 2015-01-01        5     3
6 2015-01-01        6    NA

And here's the head of the result, df1.1:

> head(df1.1)
        date interval steps
1 2015-01-01        1   5.0
2 2015-01-01        2   7.0
3 2015-01-01        3   5.7
4 2015-01-01        4  12.0
5 2015-01-01        5   3.0
6 2015-01-01        6   5.7

Here's a table of the group means to show where those 5.7s come from:

> df1 %>% group_by(date) %>% summarise(avg = mean(steps, na.rm = TRUE))
Source: local data frame [2 x 2]

        date avg
1 2015-01-01 5.7
2 2015-01-02 4.0

edited Aug 5, 2015 at 22:54

answered Aug 5, 2015 at 22:39

ulfelder

5,3351 gold badge27 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Zach Over a year ago

I get 'NaN' when I try your approach.

ulfelder Over a year ago

Huh. It works as shown on the toy data I made to mimic the structure you describe. What happens if you do it step by step? At which step in the piping does it seem to fail?

Zach Over a year ago

It worked really well when going through it one step at a time. I think the pipes may have worked too but I was just focusing on an incorrect subset of the data. Sheesh, I have so much to learn. Your implementation isn't too complicated but I'm very annoyed that I couldn't see it as clearly as you did.

nestor556 · Accepted Answer · 2015-08-06 14:57:30Z

0

if df1 is your original dataframe and df.1 is the dataframe containing the averages by date, i think a simple for loop could solve it:

for(i in df.1$date){
  df1[df1$date==i,"steps"]=df.1[df.1$date==i,"avg"]
}

it works for the toy example I just created, I hope it helps.

answered Aug 6, 2015 at 14:57

nestor556

4555 silver badges16 bronze badges

Collectives™ on Stack Overflow

Imputing Missing Values in R from reference data frame

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related