1

Below is the summary and structure of the two data sets I tried to merge claimants and unemp, they can me found here claims.csv and unemp.csv

 > tbl_df(claimants)
# A tibble: 6,960 × 5
       X       County  Month  Year Claimants
   <int>       <fctr> <fctr> <int>     <int>
1      1      ALAMEDA    Jan  2007     13034
2      2       ALPINE    Jan  2007        12
3      3       AMADOR    Jan  2007       487
4      4        BUTTE    Jan  2007      3496
5      5    CALAVERAS    Jan  2007       644
6      6       COLUSA    Jan  2007      1244
7      7 CONTRA COSTA    Jan  2007      8475
8      8    DEL NORTE    Jan  2007       328
9      9    EL DORADO    Jan  2007      2120
10    10       FRESNO    Jan  2007     19974
# ... with 6,950 more rows


> tbl_df(unemp)
# A tibble: 6,960 × 7
    County  Year Month laborforce emplab unemp unemprate
*    <chr> <int> <chr>      <int>  <int> <int>     <dbl>
1  Alameda  2007   Jan     743100 708300 34800       4.7
2  Alameda  2007   Feb     744800 711000 33800       4.5
3  Alameda  2007   Mar     746600 713200 33300       4.5
4  Alameda  2007   Apr     738200 705800 32400       4.4
5  Alameda  2007   May     739100 707300 31800       4.3
6  Alameda  2007   Jun     744900 709100 35800       4.8
7  Alameda  2007   Jul     749600 710900 38700       5.2
8  Alameda  2007   Aug     746700 709600 37000       5.0
9  Alameda  2007   Sep     748200 712100 36000       4.8
10 Alameda  2007   Oct     749000 713000 36100       4.8
# ... with 6,950 more rows

I thought first I should change all the factor columns to character columns.

unemp[sapply(unemp, is.factor)] <- lapply(unemp[sapply(unemp, is.factor)], as.character)

claimants[sapply(claimants, is.factor)] <- lapply(claimants[sapply(claimants, is.factor)], as.character)

m <-merge(unemp, claimants, by = c("County", "Month", "Year"))
dim(m)
[1]  0 10

In the output of dim(m), 0 rows are in the resulting dataframe. All the 6960 rows should match each other uniquely.

To verify that the two data frames have unique combination of the the 3 columns 'County', 'Month', and 'Year' I reorder and rearrange these columns within the dataframes as below:

a <- claimants[ order(claimants[,"County"], claimants[,"Month"], claimants[,"Year"]), ]

b <- unemp[ order(unemp[,"County"], unemp[,"Month"], unemp[,"Year"]), ]

b[2:4] <- b[c(2,4,3)]
a[2:4] %in% b[2:4]
[1] TRUE TRUE TRUE

This last output confirms that all 'County', 'Month', and 'Year' columns match each other in these two dataframes.

I have tried looking into the documentation for merge and could not gather where do I go wrong, I have also tried the inner_join function from dplyr:

> m <- inner_join(unemp[2:8], claimants[2:5])
Joining, by = c("County", "Year", "Month")
> dim(m)
[1] 0 8 

I am missing something and don't know what, would appreciate the help with understanding this, I know I should not have to rearrange the rows by the three columns to run merge R should identify the matching rows and merge the non-matching columns.

1
  • 3
    ALAMEDA != Alameda Commented Feb 27, 2017 at 21:28

1 Answer 1

2

The claimants df has the counties in all uppercase, the unemp df has them in lower case.

I used the options(stringsAsFactors = FALSE) when reading in your data. A few suggestions drop the X column in both, it doesn't seem useful.

options(stringsAsFactors = FALSE)
claims <- read.csv("claims.csv",header=TRUE)
claims$X <- NULL
unemp <- read.csv("unemp.csv",header=TRUE)
unemp$X <- NULL
unemp$County <- toupper(unemp$County)

m <- inner_join(unemp, claims)
dim(m)

# [1] 6960    8
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.