I have two data frames with longitude and latitude values, and I would like to extract values from data frame #2 (say column df2$C, third column of the data frame #2) which value match columns of data frame 1... for example, data frame 1 has two columns (lon,lat), and data frame 2 has three columns (lon, lat, and some value "C")... I want to add a third column to data frame 1, in which those values of df2$C correspond to those values that are an exact match of BOTH columns in both data frames, something like df1$lon == df2$lon AND df1$lat == df2$lat... and in lat, lon pairs that doesn't match, I would like to add a NA, so that the third column (that I want to add to data. frame 1) has a length that is = nrow(df1).
I tried the merge function, but I'm having troubles matching both columns of df1 to those of df2.
3 Answers
You could try data.table
library(data.table)
setDT(df1)
setkey(setDT(df2), lat, lon)
df2[df1]
# lat lon C
#1: 58 1 NA
#2: 52 10 NA
#3: 54 7 -0.9094088
#4: 60 2 NA
#5: 50 3 1.4541841
#6: 56 9 -1.7771135
#7: 59 5 NA
#8: 55 8 NA
#9: 53 4 NA
#10: 57 6 NA
data
df1 <- structure(list(lat = c(58L, 52L, 54L, 60L, 50L, 56L, 59L, 55L,
53L, 57L), lon = c(1L, 10L, 7L, 2L, 3L, 9L, 5L, 8L, 4L, 6L)), .Names = c("lat",
"lon"), row.names = c(NA, -10L), class = "data.frame")
df2 <- structure(list(lat = c(51L, 55L, 50L, 58L, 56L, 57L, 60L, 54L,
52L, 54L), lon = c(13L, 10L, 3L, 6L, 9L, 8L, 9L, 16L, 4L, 7L),
C = c(1.48642005012902, 1.53314455225747, 1.45418413640182,
-0.874122129771392, -1.77711353745745, 0.128866710402714,
-2.41118134931725, -1.78305563078752, -0.0173287724390305,
-0.909408846416724)), .Names = c("lat", "lon", "C"), row.names = c(NA,
-10L), class = "data.frame")
Comments
Since these are geocodes, one thing to watch out for is that the fields have to match exactly. So for instance if one dataset has lon/lat to 6 significant figures, and the other has lon/lat to 8 significant figures, you will get no matches (or very few). I wonder if this is why merge(...) isn't working for you. As shown below, it should work.
merge(...) should work, especially if both data frames have the same column names. Using the datasets from @akrun's answer:
merge(df1,df2, by=c("lon","lat"),all.x=TRUE)
# lon lat C
# 1 1 58 NA
# 2 2 60 NA
# 3 3 50 1.4541841
# 4 4 53 NA
# 5 5 59 NA
# 6 6 57 NA
# 7 7 54 -0.9094088
# 8 8 55 NA
# 9 9 56 -1.7771135
# 10 10 52 NA
If you don't specify the by=... argument, merge(...) will use all common columns, so in this case you could just write:
merge(df1,df2,all.x=TRUE)
You could also use join(...) is the plyr package.
library(plyr)
join(df1,df2)
All of these options produce the same result, although the rows are in different order.
The data.table approach will be fastest, although without a really large dataset (>1e5 rows) you might not notice the difference.
Comments
You can use ifelse for this. For example, with the data:
df1 <- structure(list(lat = c(58L, 52L, 54L, 60L, 50L, 56L, 59L, 55L,
53L, 57L), lon = c(1L, 10L, 7L, 2L, 3L, 9L, 5L, 8L, 4L, 6L)), .Names = c("lat",
"lon"), row.names = c(NA, -10L), class = "data.frame")
df2 <- structure(list(lat = c(51L, 55L, 50L, 58L, 56L, 57L, 60L, 54L,
52L, 54L), lon = c(13L, 10L, 3L, 6L, 9L, 8L, 9L, 16L, 4L, 7L),
C = c(1.48642005012902, 1.53314455225747, 1.45418413640182,
-0.874122129771392, -1.77711353745745, 0.128866710402714,
-2.41118134931725, -1.78305563078752, -0.0173287724390305,
-0.909408846416724)), .Names = c("lat", "lon", "C"), row.names = c(NA,
-10L), class = "data.frame")
You can create column C for df1 with
ifelse(df1[,'lat'] %in% df2[,'lat'] & df1[,'lon'] %in% df2[,'lon'],df2$C,NA)
merge(...)should work. You should show your code.