0

I am currently working on the following problem :

I have an array of j + 1 columns and i row with a first column serving to identify them (important for using this problem later on) In each subsequent j columns are values corresponding to possible awnsers, coded as letters ex :

col 0 col 1 col 2 col 3
--1-- --A-- --C-- --A--
--2-- --A-- --C-- --A--
--3-- --B-- --B-- --A--
--4-- --B-- --A-- --B--

...

I have to select a number N of rows with some conditions regarding the number of certain awnsers in each column ex :

For column one : 47 of A / 60 of B For column two : 45 of A / 20 of B / 17 of C For column three : 10 of A / 15 of B

I have to minimise the deltas between the expected numbers (as showed above) for each column so i get as close as possible to aving the exact expected numbers in my final solution.

I have not yet tried any specific code, this looks like a problem of linear optimisation but as being quite uninformed about the subjects i am having a hard time find relevant information.

Should you have any leads at all as to what terms, algorithms, anytihng i could use, it would be great !

4
  • 3
    I am afraid, after reading your question I cannot understand what you really try accomplishing... Besides counting each letter (answers, probably...) what else do you try doing? What does "the exact expected numbers in my final solution" mean? Based on what to make the mentioned "Optimization"? Commented Jun 27, 2024 at 13:18
  • Hello, basically i have, for my final result, a sample of N rows from my initial array where i need to be as close as possible toa previoulsy agreed upon count of each possible letters for each rows. I could test all possible combinations and for each count the number of letters in each column and choose the best combination (closest to my expect countà), but as my array grows larger the number of possible cominations grow exponentially so i fear that brute forcing it this way is impossible. I mainly need something to get me started as i don't really know what to look for Commented Jun 28, 2024 at 9:28
  • I am afraid that you did not clarify too much... If you do not have/provide the logic of "possible answers" distribution on each column (existing now) you do not have other choice then clearly defining (now, it looks to be only in your head) the reference to compare with, place the range (what you name "Array") in a real array (allowing to work only in memory) and start thinking to an eficient way to iterate and check. I would also advice to search for 2D arrays in VBA... Commented Jun 28, 2024 at 9:49
  • Looks a bit a like a goal programming problem. I am not sure from the description if the model has just to select N rows from a given array, or if it has to fill a new matrix with N rows. Commented Jun 28, 2024 at 11:10

1 Answer 1

0

I never reach for Excel in this kind of situation. If you really need to, you can import/export from Excel using an interop library like Pandas.

Let's interpret your question literally: you need to select a fixed number of rows that minimise the selection error based on some target sums:

import numpy as np
import pulp

n = 100                                        # number of rows in input data set
n_select = 60                                  # number of rows to select in solution
index = np.arange(1, 1+n)                      # row names
letters = np.array(tuple('ABC'))               # all possible letters
rand = np.random.default_rng(seed=0)           # random number generator for fake input
values = rand.choice(a=letters, size=(n, 3))   # fake input

# 3x100x3 (letters, n, columns): one-hot letter matches
onehot = values[np.newaxis, :, :] == letters[:, np.newaxis, np.newaxis]

# assignment variables over rows
assign = pulp.LpVariable.matrix(
    name='asn', indices=index, cat=pulp.LpBinary,
)

# target sums by column and letter, or NaN for don't-care
targets = np.array((
    (          47, 45,           10),
    (          60, 20,           15),
    (float('nan'), 17, float('nan')),
))

prob = pulp.LpProblem(name='assignment_problem', sense=pulp.LpMinimize)
prob.addConstraint(name='n_select', constraint=pulp.lpSum(assign) == n_select)

sums = []
total_error = 0

for letter, letter_onehot, letter_targets in zip(letters, onehot, targets):
    # sums for this letter over all columns, via matrix product
    letter_sums = assign @ letter_onehot
    sums.append(letter_sums)

    for i, (target, total) in enumerate(zip(letter_targets, letter_sums)):
        if np.isfinite(target):
            error = total - target
            abs_error = pulp.LpVariable(name=f'err_{letter}{i}', cat=pulp.LpContinuous)
            total_error = abs_error + total_error

            # constraints to get effect of absolute error
            prob.addConstraint(
                name=f'err_{letter}{i}l', constraint=abs_error >= error,
            )
            prob.addConstraint(
                name=f'err_{letter}{i}h', constraint=abs_error >= -error,
            )


prob.setObjective(total_error)
print(prob)
prob.solve()
assert prob.status == pulp.LpStatusOptimal

print('Assignment sums by column and letter:')
for row in sums:
    print([s.value() for s in row])
assignment_problem:
MINIMIZE
1*err_A0 + 1*err_A1 + 1*err_A2 + 1*err_B0 + 1*err_B1 + 1*err_B2 + 1*err_C1 + 0
SUBJECT TO
n_select: asn_1 + asn_10 + asn_100 + asn_11 + asn_12 + asn_13 + asn_14
 + asn_15 + asn_16 + asn_17 + asn_18 + asn_19 + asn_2 + asn_20 + asn_21
 + asn_22 + asn_23 + asn_24 + asn_25 + asn_26 + asn_27 + asn_28 + asn_29
 + asn_3 + asn_30 + asn_31 + asn_32 + asn_33 + asn_34 + asn_35 + asn_36
 + asn_37 + asn_38 + asn_39 + asn_4 + asn_40 + asn_41 + asn_42 + asn_43
 + asn_44 + asn_45 + asn_46 + asn_47 + asn_48 + asn_49 + asn_5 + asn_50
 + asn_51 + asn_52 + asn_53 + asn_54 + asn_55 + asn_56 + asn_57 + asn_58
 + asn_59 + asn_6 + asn_60 + asn_61 + asn_62 + asn_63 + asn_64 + asn_65
 + asn_66 + asn_67 + asn_68 + asn_69 + asn_7 + asn_70 + asn_71 + asn_72
 + asn_73 + asn_74 + asn_75 + asn_76 + asn_77 + asn_78 + asn_79 + asn_8
 + asn_80 + asn_81 + asn_82 + asn_83 + asn_84 + asn_85 + asn_86 + asn_87
 + asn_88 + asn_89 + asn_9 + asn_90 + asn_91 + asn_92 + asn_93 + asn_94
 + asn_95 + asn_96 + asn_97 + asn_98 + asn_99 = 60

err_A0l: - asn_10 - asn_13 - asn_15 - asn_17 - asn_2 - asn_3 - asn_31 - asn_32
 - asn_33 - asn_35 - asn_38 - asn_40 - asn_42 - asn_44 - asn_51 - asn_57
 - asn_60 - asn_63 - asn_77 - asn_79 - asn_83 - asn_84 - asn_85 - asn_86
 - asn_95 + err_A0 >= -47

err_A0h: asn_10 + asn_13 + asn_15 + asn_17 + asn_2 + asn_3 + asn_31 + asn_32
 + asn_33 + asn_35 + asn_38 + asn_40 + asn_42 + asn_44 + asn_51 + asn_57
 + asn_60 + asn_63 + asn_77 + asn_79 + asn_83 + asn_84 + asn_85 + asn_86
 + asn_95 + err_A0 >= 47
...
VARIABLES
0 <= asn_1 <= 1 Integer
0 <= asn_10 <= 1 Integer
0 <= asn_100 <= 1 Integer
...
err_A0 free Continuous
err_A1 free Continuous
err_A2 free Continuous
err_B0 free Continuous
err_B1 free Continuous
err_B2 free Continuous
err_C1 free Continuous
...
Result - Optimal solution found

Objective value:                81.00000000
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             0.00
Time (Wallclock seconds):       0.00

Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.01   (Wallclock seconds):       0.01

Assignment sums by column and letter:
[19.0, 23.0, 10.0]
[30.0, 20.0, 16.0]
[11.0, 17.0, 34.0]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.