2

I have the results of games played between multiple players at different points in time. I have this information from two different sources who assign different unique ids to each player. I would like to find an eloquent way to match up the two data sources by player id. The two data sources

sourcex <- structure(list(outcomedate = structure(c(12637, 12637, 12637, 
12637, 12637, 12637, 12637, 12637, 12637, 12637, 12637, 12638, 
12639, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 
12641, 12641, 12641, 12643, 12644, 12644, 12644, 12644, 12644, 
12644, 12644, 12644, 12644, 12644, 12645), class = "Date"), xid1 = c(206L, 
208L, 209L, 216L, 233L, 235L, 239L, 241L, 250L, 253L, 259L, 238L, 
236L, 211L, 221L, 234L, 249L, 254L, 255L, 257L, 258L, 207L, 230L, 
248L, 258L, 207L, 211L, 230L, 234L, 236L, 248L, 249L, 254L, 255L, 
257L, 221L), xid2 = c(211L, 207L, 221L, 249L, 248L, 257L, 234L, 
255L, 236L, 258L, 254L, 230L, 241L, 253L, 235L, 238L, 208L, 233L, 
239L, 259L, 206L, 209L, 250L, 216L, 259L, 216L, 241L, 208L, 235L, 
239L, 253L, 250L, 209L, 238L, 206L, 233L), outcome1 = c(2L, 1L, 
0L, 2L, 1L, 3L, 1L, 1L, 2L, 2L, 0L, 2L, 3L, 3L, 1L, 0L, 2L, 0L, 
0L, 0L, 2L, 1L, 2L, 1L, 0L, 3L, 2L, 0L, 0L, 0L, 2L, 2L, 2L, 1L, 
1L, 1L), outcome2 = c(0L, 0L, 0L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 
0L, 1L, 0L, 1L, 0L, 0L, 1L, 2L, 0L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 
2L, 0L, 1L, 1L, 2L, 1L, 0L, 1L, 1L, 3L)), .Names = c("outcomedate", 
"xid1", "xid2", "outcome1", "outcome2"), row.names = c(NA, 36L
), class = "data.frame")

sourcey <- structure(list(outcomedate = structure(c(12637, 12637, 12637, 
12637, 12637, 12637, 12637, 12637, 12637, 12637, 12637, 12638, 
12639, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 
12641, 12641, 12641, 12643, 12644, 12644, 12644, 12644, 12644, 
12644, 12644, 12644, 12644, 12644, 12645), class = "Date"), yid1 = c(56, 
46, 67, 68, 59, 63, 55, 50, 66, 61, 57, 58, 53, 60, 64, 48, 69, 
54, 51, 65, 62, 47, 49, 52, 64, 60, 47, 48, 69, 49, 54, 51, 65, 
53, 52, 62), yid2 = c(47, 51, 64, 48, 62, 69, 53, 54, 60, 49, 
65, 52, 50, 63, 57, 56, 61, 46, 58, 67, 66, 59, 68, 55, 63, 57, 
68, 55, 59, 67, 58, 66, 50, 46, 56, 61), outcome1 = structure(c(1L, 
1L, 2L, 2L, 3L, 3L, 2L, 1L, 4L, 1L, 2L, 2L, 4L, 3L, 2L, 2L, 3L, 
3L, 3L, 4L, 1L, 1L, 1L, 2L, 3L, 1L, 4L, 2L, 2L, 2L, 1L, 3L, 2L, 
3L, 3L, 1L), .Label = c("1", "2", "0", "3", "4", "5", "6"), class = "factor"), 
    outcome2 = structure(c(1L, 2L, 3L, 2L, 1L, 1L, 2L, 2L, 3L, 
    2L, 1L, 2L, 1L, 3L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 2L, 3L, 
    2L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("0", 
    "1", "2", "3", "4"), class = "factor")), .Names = c("outcomedate", 
"yid1", "yid2", "outcome1", "outcome2"), row.names = c(NA, 36L
), class = "data.frame")

Both sources have an outcomedate, outcome1, outcome2 in common. They assign different ids to the individual players in the game. I have done the following to find the match between ids.

sourcex$ID <- with(sourcex, paste0(outcomedate, outcome1, outcome2))
sourcey$ID <- with(sourcey, paste0(outcomedate, outcome1, outcome2))

uPlayersx <- with(sourcex, unique(c(xid1, xid2)))
uPlayersy <- with(sourcey, unique(c(yid1, yid2)))

comparex <- sapply(uPlayersx, function(x){
  paste0(with(sourcex, ID[xid1 == x| xid2 == x]), collapse = '~')
})

comparey <- sapply(uPlayersy, function(x){
  paste0(with(sourcey, ID[yid1 == x| yid2 == x]), collapse = '~')
})

dumMatch <- data.frame(xid = uPlayersx, yid = uPlayersy[match(comparex, comparey)])

It works ok here on this test dataset however the real application is larger and this feels like a hack. Also the real datasets may have errors in reporting etc so partial matches might be needed. Any help would be appreciated.

2
  • Interesting problem. How big is your real data? I am thinking number of games, number of dates, number of teams. Commented Jul 6, 2013 at 2:15
  • @flodel I have maybe 15 years of data and several hundred players. Also there is upto 10 data sources and probably missing data and errors etc. I had looked for an R api to google refine (open refine). I may look into writing one if nothing similar is available. Commented Jul 6, 2013 at 3:07

1 Answer 1

1

This will (at least) help filtering out days that match perfectly:

match.day <- function(d)
{
    tempx <- sourcex[sourcex$outcomedate==d,]
    tempy <- sourcey[sourcey$outcomedate==d,]

    if(nrow(tempx)!=nrow(tempy)) stop("matching failed: number of rows differ.")

    P <- outer(tempx$outcome1, tempy$outcome1, `==`) &
         outer(tempx$outcome2, tempy$outcome2, `==`)

    if(any(rowSums(P)!=1)) stop("maching failed: ambiguous or impossible assignment.")

    map <- P %*% seq_len(nrow(tempy))

    cbind(tempx[,c("xid1","xid2")], tempy[map,c("yid1","yid2")])
}

days <- unique(c(sourcex$outcomedate, sourcey$outcomedate))

do.call(rbind, lapply(days[-c(1,4,7)], match.day))

Note that it failed for days 1, 4 and 7 (see days[c(1,4,7)]).

Result for other days:

   xid1 xid2 yid1 yid2
12  238  230   58   52
13  236  241   53   50
22  207  209   47   59
23  230  250   52   55
24  248  216   49   68
25  258  259   64   63
36  221  233   62   61
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.