0

I have a data frame 'dat' of dim 17000 x 3 of walking data. The interval column is 5 minute intervals for each 24 hour period, the date column is the date and the steps column is the number of steps taken in said 5 minute period on said date. NA's are present.

> head(df1)
  steps       date interval
1    NA 2012-10-01        0
2    NA 2012-10-01        5
3    NA 2012-10-01       10
4    NA 2012-10-01       15
5    NA 2012-10-01       20
6    NA 2012-10-01       25

I've used dplyr to group my df by date and then created a new df 'df.1' and summarized it as avg=mean(df.1$steps, na.rm = TRUE). This gives me a nice little df of the mean value of steps on each date

         date      avg
1  2012-10-01      NaN
2  2012-10-02  0.43750
3  2012-10-03 39.41667
4  2012-10-04 42.06944
5  2012-10-05 46.15972
6  2012-10-06 53.54167

What I would like to do is update my original df's NA-values with the mean value from each date.

So in the first table where 2012-10-02 was NA then I'd like to replace ever NA value in table one for 2012-10-02 with the value 0.43750. I've tried using indices, which, %in%, apply family and just can't find anything that is sticking.

Any help would be greatly appreciated.

5
  • 1
    Have you tried merge. Also, if you have used dplyr, mutate would be an option to add the column to the original dataset instead of summarise Commented Aug 5, 2015 at 22:18
  • So maybe use rownames or index values and merge on like index values? Commented Aug 5, 2015 at 22:19
  • May be library(dplyr); df1 %>% group_by(date) %>% mutate(avg= mean(steps, na.rm=TRUE)) Commented Aug 5, 2015 at 22:21
  • I already have that bit (the mean calculated). What I need is to then update every NA value in original df with the mean value for that given day. Commented Aug 5, 2015 at 22:23
  • What I meant is that you don't need to create a second dataset instead you can do this in one step by mutate. If you need to do, then merge(df1, df1.1, by='date', all=TRUE) and then change the NA value in steps by the new column Commented Aug 5, 2015 at 22:23

2 Answers 2

2

This is a little clunky, but it works:

library(dplyr)
df1.1 <- df1 %>%
    group_by(date) %>%
    summarise(avg = mean(steps, na.rm = TRUE)) %>%
    merge(df1, ., all.x=TRUE) %>%
    mutate(steps = ifelse(is.na(steps)==TRUE, avg, steps)) %>%
    select(-avg)

Here's my toy data:

df1 <- data.frame(date = c(rep("2015-01-01", 12), rep("2015-01-02", 12)), interval = rep(seq(12), 2),
    steps = c(5, 7, NA, 12, 3, NA, 0, 4, 12, 10, 4, 0, 3, NA, 2, 1, NA, 15, 0, 4, 7, 2, NA, 2),
    stringsAsFactors = FALSE)

Which looks like:

> head(df1)
        date interval steps
1 2015-01-01        1     5
2 2015-01-01        2     7
3 2015-01-01        3    NA
4 2015-01-01        4    12
5 2015-01-01        5     3
6 2015-01-01        6    NA 

And here's the head of the result, df1.1:

> head(df1.1)
        date interval steps
1 2015-01-01        1   5.0
2 2015-01-01        2   7.0
3 2015-01-01        3   5.7
4 2015-01-01        4  12.0
5 2015-01-01        5   3.0
6 2015-01-01        6   5.7

Here's a table of the group means to show where those 5.7s come from:

> df1 %>% group_by(date) %>% summarise(avg = mean(steps, na.rm = TRUE))
Source: local data frame [2 x 2]

        date avg
1 2015-01-01 5.7
2 2015-01-02 4.0
Sign up to request clarification or add additional context in comments.

3 Comments

I get 'NaN' when I try your approach.
Huh. It works as shown on the toy data I made to mimic the structure you describe. What happens if you do it step by step? At which step in the piping does it seem to fail?
It worked really well when going through it one step at a time. I think the pipes may have worked too but I was just focusing on an incorrect subset of the data. Sheesh, I have so much to learn. Your implementation isn't too complicated but I'm very annoyed that I couldn't see it as clearly as you did.
0

if df1 is your original dataframe and df.1 is the dataframe containing the averages by date, i think a simple for loop could solve it:

for(i in df.1$date){
  df1[df1$date==i,"steps"]=df.1[df.1$date==i,"avg"]
}

it works for the toy example I just created, I hope it helps.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.