Imagine the following data (the code to reproduce all the outputs is at the end):
df
cars horsepower year safety
1 Toyota 140 2008 4
2 Chrysler 120 2009 4
3 Ford 140 2010 5
4 BMW 150 2008 3
5 Mercedes-Benz 150 2008 3
6 Hyundai 120 2009 4
7 Jaguar 150 2007 3
8 Tesla 120 2010 5
I'd like to swap the cars to get something like:
cars_initial cars_match horsepower year safety horsepowerMatch yearMatch safetyMatch
1 Toyota BMW 140 2008 4 150 2008 3
2 Tesla Chrysler 120 2010 5 120 2009 4
3 Mercedes-Benz Ford 150 2008 3 140 2010 5
4 Jaguar Hyundai 150 2007 3 120 2009 4
5 Hyundai Jaguar 120 2009 4 150 2007 3
6 Ford Mercedes-Benz 140 2010 5 150 2008 3
7 Chrysler Tesla 120 2009 4 120 2010 5
8 BMW Toyota 150 2008 3 140 2008 4
Now this is a typical assignment problem that was in the case above solved randomly, i.e. by cost matrix set to 0 in all cases.
What I'm interested in are the outcomes. In the above case, the solution yields the following stats:
stats
horsepower year safety
1 0.25 0.25 0
That is to say, 1/4 of swaps had an equal horsepower, etc.
Here is my question: How to solve such assignments by setting constraints on what exactly should be the outcome statistics directly, without the trial-and-error approach with setting the costs?
For instance, what if I would like to have a solution where safety has more than 0.20 match, and year at least 0.10, like below?
desiredOutput
cars_initial cars_match
1 Toyota Chrysler
2 Tesla Mercedes-Benz
3 Mercedes-Benz BMW
4 Jaguar Toyota
5 Hyundai Tesla
6 Ford Hyundai
7 Chrysler Jaguar
8 BMW Ford
statsDesired
horsepower year safety
1 0.25 0.12 0.25
Of course I could just set the cost matrix to a lower number in all cases where safety of cars is equal.
But is there a way to influence the result by directly setting the constraint on what should be the outcome statistics?
Perhaps there is a way to optimize the costs in order to arrive at the desired result?
The code:
library(lpSolve)
library(dplyr)
library(tidyr)
set.seed(1)
df <- data.frame(
cars = c("Toyota", "Chrysler", "Ford", "BMW", "Mercedes-Benz", "Hyundai", "Jaguar", "Tesla"),
horsepower = c(140, 120, 140, 150, 150, 120, 150, 120),
year = c(2008, 2009, 2010, 2008, 2008, 2009, 2007, 2010),
safety = c(4, 4, 5, 3, 3, 4, 3, 5)
)
mat <- df %>% select(cars) %>%
crossing(df %>% select(cars)) %>%
mutate(val = 0) %>%
spread(cars, val)
solved <- lp.assign(mat %>% select(-cars1) %>% as.matrix())$solution
matches <- as.data.frame(solved) %>%
setNames(., names(mat %>% select(-cars1))) %>%
bind_cols(mat %>% select(cars1)) %>%
gather(key, val, -cars1) %>%
filter(val == 1) %>% select(-val, cars_initial = cars1, cars_match = key)
nms <- c("cars", paste0(names(df %>% select(-cars)), "Match"))
matches <- matches %>%
left_join(df, by = c("cars_initial" = "cars")) %>%
left_join(df %>% setNames(., nms), by = c("cars_match" = "cars"))
stats <- matches %>%
summarise(
horsepower = round(sum(horsepower == horsepowerMatch) / n(), 2),
year = round(sum(year == yearMatch) / n(), 2),
safety = round(sum(safety == safetyMatch) / n(), 2)
)
desiredOutput <- data.frame(cars_initial = matches$cars_initial, cars_match = c("Chrysler", "Mercedes-Benz", "BMW", "Toyota", "Tesla", "Hyundai", "Jaguar", "Ford"))
statsDesired <- desiredOutput %>%
left_join(df, by = c("cars_initial" = "cars")) %>%
left_join(df %>% setNames(., nms), by = c("cars_match" = "cars")) %>%
summarise(
horsepower = round(sum(horsepower == horsepowerMatch) / n(), 2),
year = round(sum(year == yearMatch) / n(), 2),
safety = round(sum(safety == safetyMatch) / n(), 2)
)
I hope the examples above are sufficient, this is my first question so please let me know if I need to provide something more.
The code is in R, but I have also added the tag Python as I don't really mind the language of possible solutions.
a% of the pairs have the same HP as each other, at leastb% of the pairs have the same year as each other, and at leastc% of the pairs have the same safety as each other?