how to make an optimal combinatorial selection in R

Ask Question

Asked 6 years, 11 months ago

Modified 6 years, 11 months ago

Viewed 203 times

The problem I'm trying to solve is basically the same as the one in this post:

https://stats.stackexchange.com/questions/339935/python-library-for-combinatorial-optimization

And my current implementation uses indeed a genetic algorithm based optimizer.

However, I would like to solve it as a binary linear programming problem (at least try, even though it's 'NP-hard', apparently).

My question is how to formulate the LP in the best way, because I am not sure I am doing it right.

The following is a simplified version of what I'm dealing with, which however shows exactly where the problem lies.

We make m*n (in this case 6) objects by a combinatorial process taking m (3) objects of type 'R1' (say {A,B,C}) and n (2) objects of type 'R2' (say {X,Y}).
The 6 objects {AX,AY,BX,BY,CX,CY} are evaluated and each gets a score D, in this case {0.8,0.7,0.5,0.9,0.4,0.0}, in this order.

CL <- cbind(expand.grid(R2=LETTERS[24:25],R1=LETTERS[1:3],stringsAsFactors = FALSE),D=c(0.8,0.7,0.5,0.9,0.4,0.0))

Now we want to select 2 distinct R1's and 1 R2 such that the sum of D is maximal.

In this example, the answer is R1 = {A,B}, R2 = {Y}.

However, one would not get to such conclusion taking, for instance, the 2 R1's and the R2 with the highest average D.
It would work for R1, but not for R2:

aggregate(D~R1,CL,mean)
#  R1    D
#1  A 0.75
#2  B 0.70
#3  C 0.20
aggregate(D~R2,CL,mean)
#  R2         D
#1  X 0.5666667
#2  Y 0.5333333

I know how to formulate this as a linear programming problem; only I am not sure my formulation is efficient, because basically it results in a problem with mn+m+n variables and 2(m+n)+2 constraints.
The main difficulty is that I need somehow to count the number of distinct R1's and R2's chosen, and I don't know any way of doing that apart from what I will show below (and is also described in my other post here).

This is what I would do:

CL["Entry"] <- seq_len(dim(CL)[[1]])

R1.mat <- table(CL$R1,CL$Entry)
R2.mat <- table(CL$R2,CL$Entry)

N_R1 <- dim(R1.mat)[[1]]
N_R2 <- dim(R2.mat)[[1]]
N_Entry <- dim(CL)[[1]]

constr.mat <- NULL
dir <- NULL
rhs <- NULL

constr.mat <- rbind(constr.mat,cbind(R1.mat,-diag(table(CL$R1)),matrix(0,N_R1,N_R2)))
dir <- c(dir,rep("<=",N_R1))
rhs <- c(rhs,rep(0,N_R1))

constr.mat <- rbind(constr.mat,cbind(R2.mat,matrix(0,N_R2,N_R1),-diag(table(CL$R2))))
dir <- c(dir,rep("<=",N_R2))
rhs <- c(rhs,rep(0,N_R2))

constr.mat <- rbind(constr.mat,constr.mat)
dir <- c(dir,rep(">=",N_R1+N_R2))
rhs <- c(rhs,1-table(CL$R1),1-table(CL$R2))

constr.mat <- rbind(constr.mat,c(rep(0,N_Entry),rep(1,N_R1),rep(0,N_R2)))
dir <- c(dir,"==")
rhs <- c(rhs,2)

constr.mat <- rbind(constr.mat,c(rep(0,N_Entry),rep(0,N_R1),rep(1,N_R2)))
dir <- c(dir,"==")
rhs <- c(rhs,1)

obj <- c(aggregate(D~Entry,CL,c)[["D"]],rep(0,N_R1+N_R2))

Which can be solved for instance by lpSolve:

sol <- lp("max", obj, constr.mat, dir, rhs, all.bin = TRUE,num.bin.solns = 1, use.rw=FALSE, transpose.constr=TRUE)
sol$solution
#[1] 0 1 0 1 0 0 1 1 0 0 1

showing that products {AY,BY} were selected, corresponding to R1 = {A,B} and R2 = {Y}:

CL[as.logical(sol$solution[1:N_Entry]),]
#  R2 R1   D Entry
#2  Y  A 0.7     2
#4  Y  B 0.9     4

I found that on large problems lpSolve gets stuck for ages; Rsymphony seemed to perform better.
But again, my main question is: is this way of formulating the LP efficient? Should I do it differently?

Thanks!

EDIT

In the meantime, working on a somewhat related problem, I found that only one set of constraints may be sufficient, if one adds 'costs' (in this example, negative) to the objective vector for the 'distinct R1 and R2' variables.
So here, instead of:

obj <- c(aggregate(D~Entry,CL,c)[["D"]],rep(0,N_R1+N_R2))

I would do:

obj <- c(aggregate(D~Entry,CL,c)[["D"]],rep(-1,N_R1+N_R2))

This would make m+n constraints unnecessary.
It still remains a huge problem to solve, even for relatively small m, n, so if anyone can advise how to do it better...
I had a look at lp.transport, but that would be limited to 2 dimensions (i.e. only R1 and R2, not R1, R2, R3 for instance), and I don't think you can constrain the number of distinct objects per category in that kind of solver.

edited Dec 13, 2018 at 19:20

asked Dec 11, 2018 at 9:59

user6376297

6774 silver badges21 bronze badges

This question is confusing and very long. I would suggest scraping it and asking again using a minimum working example and a more concise description of what you're trying to do and where it's going wrong.

Richard
– Richard

2018-12-16 03:03:08 +00:00
Commented Dec 16, 2018 at 3:03
The usual approach to an NP-hard problem is to either find a bounded approximation (e.g. "you're guaranteed to be within a factor of 2 of optimal") or, in the case of mathematical programming, to find a way to relax your system from integer land to a continuous space.

Richard
– Richard

2018-12-16 03:04:51 +00:00
Commented Dec 16, 2018 at 3:04
I'd also recommend looking at Google OR tools. Perhaps they have an algorithm custom-made for this sort of thing.

Richard
– Richard

2018-12-16 03:05:46 +00:00
Commented Dec 16, 2018 at 3:05
Thank you for your comments, Richard. I had a brief look at Google OR, but I need to implement my calculations as R scripts (at least, I am not aware of any other way), and I seemed to understand that Google OR tools have no R implementations. As for the question being long, sorry about that, but frankly I would not know what to remove while still keeping it intelligible. The explanation of what I am doing and what is going wrong unfortunately requires showing the current code, and that is not short. If you remove the code section, I wouldn't say this is the longest post I've ever seen.

user6376297
– user6376297

2018-12-16 08:26:42 +00:00
Commented Dec 16, 2018 at 8:26
BTW, if you tell me what you think is "confusing" I will try and clarify it. I took great care to explain things step by step, with examples and without making any logical leap, which is partly why the post turned out somewhat long. I often found very short posts riddled with wrong/approximate terminology and 'assumptions that we already know what the poster is talking about' more confusing than long ones. Unless one just rambles on without getting anywhere. I do hope this is not the case here.

user6376297
– user6376297

2018-12-16 08:33:33 +00:00
Commented Dec 16, 2018 at 8:33

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

how to make an optimal combinatorial selection in R

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked