1

I have got a dataframe of ~12000 observations with two columns "Code" and "Date". Each code should have 4 observations and therefore 4 dates, but I have got missing values (not NA, but non-existing rows) within the "Date" column.

Here an example of my dataframe:

Station Date        
7002    17/12/1966  
7002    05/05/1968  
7002    30/10/1968  
7002    16/08/1970      
7003    02/12/1966  
7003    05/05/1968  
7003    31/10/1968  
8004    04/07/1968  
8004    15/11/1968  
8006    13/10/1966  
8006    23/09/1967  
8006    01/09/1968  

[....]

What I need to do is detect for each code the rows which are missing.

I am using "water years", which start from the 1st October and end on the next 30th September e.g. 01/10/1998 - 30/09/1999. This is the difficult thing, which makes my question different from the other ones similar.

The time period considered ranges from 01/10/1966 to 30/09/1970 (4 water years) and the observations in the column "Date" are already fixed for water years (i.e. one observation per water year).

My output should be like: e.g.

Station Date       
7002    17/12/1966  
7002    05/05/1968
7002    30/10/1968
7002    16/08/1970    
7003    02/12/1966
7003    05/05/1968  
7003    31/10/1968  
7003    NA
8004    NA
8004    04/07/1968  
8004    15/11/1968  
8004    NA
8006    13/10/1966  
8006    23/09/1967  
8006    01/09/1968  
8006    NA
[...]
4
  • You state you had "12000 observations with three columns "Code" and "Date" - there are only two columns here. Commented Jun 29, 2016 at 21:45
  • sorry, just fixed. Commented Jun 29, 2016 at 21:47
  • It looks like for station 8006 you have two observations in the same water year. Commented Jun 29, 2016 at 21:50
  • eipi10 went the distance and provided a fantastic solution. My only comment would be to get a quick idea of which stations are missing data, you could run table(unlist(dat$ID))[table(unlist(dat$ID)) < 4] - which will let you know which Stations have less than 4 entries, then just rbind() NA rows for those particular stations. Commented Jun 29, 2016 at 21:53

1 Answer 1

2
library(lubridate)
library(dplyr)

Set up sample data:

dat = read.table(text="Station Date        Day
7002    17/12/1966  77
                 7002    05/05/1968  582
                 7002    30/10/1968  760
                 7002    16/08/1970  1415    
                 7003    02/12/1966  62
                 7003    05/05/1968  582
                 7003    31/10/1968  761
                 8004    04/07/1968  4294
                 8004    15/11/1968  4428
                 8006    13/10/1966  5856
                 8006    23/09/1967  6567
                 8006    01/09/1968  6910", header=TRUE, stringsAsFactors=FALSE)

dat$Date = as.Date(dat$Date, format=c("%d/%m/%Y"))

Add water year: I've assumed that the water year is named by the year of the start of the water year. For example, water year 01/10/1967 - 30/09/1968 is water year 1967.

dat$water.year = ifelse(month(dat$Date) %in% 1:9, year(dat$Date) - 1, year(dat$Date))

Add rows for missing years: I do this by merging with a new data frame that includes all combinations of Station and water.year.

full_join(expand.grid(Station=unique(dat$Station), water.year=1966:1969),
          dat,
          by=c("Station","water.year")) %>% arrange(Station, water.year)
   Station water.year       Date  Day
1     7002       1966 1966-12-17   77
2     7002       1967 1968-05-05  582
3     7002       1968 1968-10-30  760
4     7002       1969 1970-08-16 1415
5     7003       1966 1966-12-02   62
6     7003       1967 1968-05-05  582
7     7003       1968 1968-10-31  761
8     7003       1969       <NA>   NA
9     8004       1966       <NA>   NA
10    8004       1967 1968-07-04 4294
11    8004       1968 1968-11-15 4428
12    8004       1969       <NA>   NA
13    8006       1966 1966-10-13 5856
14    8006       1966 1967-09-23 6567
15    8006       1967 1968-09-01 6910
16    8006       1968       <NA>   NA
17    8006       1969       <NA>   NA
Sign up to request clarification or add additional context in comments.

2 Comments

hi, thanks a lot! but the "water.year" column is shifted forward by one year. e.g. 7002 1966-12-17 is water year 1966 (not 1967), etc... how can I fix the problem? I tried to remove +1 but the dataframe will mess. Thanks
See updated code. Two changes: One, fix the water.year calculation. Two, change 1967:1970 to 1966:1969 in expand.grid.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.