0

This is a follow-up question from what I asked here Merging a dataset based on ID and date (thanks for the helpful advice!)

I have two datasets each that are monthly summaries of different variables for households. I want to merge the two based on the household ID and the month.

df1 one looks like this:

     hh_ids      date total
     <chr>     <chr> <dbl>
1  KELDK13  2013-8-1     1
2  KMOMB02  2013-2-1     1
3  KMOMB02  2013-5-1     2
4  KMOMB04  2013-7-1     2
5  KMOMB04  2013-9-1     1
6  KMOMB06  2013-6-1     1
7  KMOMB14  2013-8-1     1
8  KMOMB16  2013-6-1     1
9  KMOMB17 2012-10-1     1
10 KMOMB17 2012-11-1     2

and the first 10 rows of df2 looks like:

  hh_ids      date    income consumption alcohol cleaning_materials  clothing
1  KELDK01 2012-11-1  62.70588    40.52941       0           0.000000  0.000000
2  KELDK01 2012-12-1  17.64706    42.43530       0           1.058824  7.058824
3  KELDK01 2013-01-1  91.76471    48.23529       0           0.000000  0.000000
4  KELDK01 2013-02-1  91.76470   107.52940       0           0.000000  0.000000
5  KELDK01 2013-03-1 116.47060   114.47060       0           0.000000  0.000000
6  KELDK01 2013-04-1 124.41180   118.29410       0           2.705882 17.647060
7  KELDK01 2013-05-1 137.23530   105.00000       0           1.411765  1.882353
8  KELDK01 2013-06-1 131.52940   109.54120       0           4.352942  2.941176
9  KELDK01 2013-07-1 121.52940   113.47060       0           2.352941 25.882350
10 KELDK01 2013-08-1 123.32940    86.50588       0           2.588235  2.941176

I want the "total" column from to be added on as a column in df2 with the matched hh_ids and date.

I have tried to do the following:

df3<-merge(df2,df1,by=c("hh_ids","date"),all=TRUE) 

but the resultant data.drame df3 now looks like this (i've chopped some columns to make it easier to display):

      hh_ids      date    income consumption total
1    KELDK01 2012-11-1  62.70588    40.52941     0
2    KELDK01 2012-12-1  17.64706    42.43530     0
3    KELDK01 2013-01-1  91.76471    48.23529     0
4    KELDK01 2013-02-1  91.76470   107.52940     0
5    KELDK01 2013-03-1 116.47060   114.47060     0
6    KELDK01 2013-04-1 124.41180   118.29410     0
7    KELDK01 2013-05-1 137.23530   105.00000     0
8    KELDK01 2013-06-1 131.52940   109.54120     0
9    KELDK01 2013-07-1 121.52940   113.47060     0
2595 KNBOM33     15918        NA          NA     1
2596 KNBOM33     15979        NA          NA     1
2597 KNBOM33 2012-10-1  32.94118    18.11765     0
2598 KNBOM33 2012-12-1  56.47059    44.23529     0
2599 KNBOM33 2013-01-1  10.58824    29.76471     0
2600 KNBOM33 2013-02-1  54.70588    70.21176     0

It seems that something has happened to df1's date in the merge, and therefore it didn't merge with the other values in df2. Does anyone have any suggestions as to what I should do? I have double checked that both df1 and df2 have the "date" column formatted as a date before the merge.

2
  • Are you sure you want an outer join? I mean, if there is not a match between the two data frames, what's your expected result? Do you want to show a row even if there is not a match? If that's not your case, I would remove the all=TRUE in your merge Commented Aug 4, 2016 at 16:43
  • There should be a match for all rows in df1 in df2. I would like to keep all rows in the case there are any that don't get matched then I know there's an issue Commented Aug 4, 2016 at 16:48

2 Answers 2

1

It seems the type of "date", where "2013-8-1" is different from "2013-08-1".

before merge them:

df1$date = as.Date(df1$date);
df2$date = as.Date(df2$date);
Sign up to request clarification or add additional context in comments.

1 Comment

That seems to have done it! I have no idea why as the directly previous line was mutate("date"=as.Date(paste(year,month,day,sep='-'))), but I won't ask questions and just be thankful it worked!
0

I'd start off by checknig the rows with hh_ids == "KNB0M33" to see if there is anything funny going on in the dates, income and consumption columns.

1 Comment

I think the names are ok because when I sort them by hh_ids, data from df1 and df2 gets sorted ok (eg rows 2596-2597)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.