Filter data into binary classes on GPU

Ask Question

Asked 6 months ago

Modified 6 months ago

Viewed 90 times

I have a ML problem where I want to leverage the power of Support Vector Classifiers (SVC) or any other 2-class classifier and compare them to my NN models. The probelm is, that binary classifiers are ... well ... binary. So I need to create (n_class * (n_class-1))/2 classifiers and compare them in a 1-vs-1 manner. I am using Scikit-Learn classifiers, but these are only single threaded and therefore painfully slow (a single operation has been running for over 2-days now). I have used multiprocessing.Pool to run the binary SVCs in parallel and then create the voting system by hand. This works wonders as my data can be fitted in just 6mins.

But..., and here is my current question, how do I filter my training data into so many binary sets? Since the number of binary classifiers (and therefore the binary datasets) scales quadratically, it takes a good half an hour to sort through my (reduced) training data. The sorting does not need to be sequential in any way, it just needs to be quick.

Here's my current solution:

 def make_pairs_ds(self, X, y):
    ijs = [[i, j] for i in range(self.n_cls-1) for j in range(i+1, self.n_cls)]
    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
        self.pair_ds = pool.map(filter_task, zip(
            [X for _ in range(len((ijs)))], 
            [y for _ in range(len((ijs)))], 
            ijs,
            [self.unique_cls for _ in range(len(ijs))]
        ))

With the single filtering (dataset constructing) task being:

def filter_task(X_y_ij_c):
    X = X_y_ij_c[0]
    y = X_y_ij_c[1]
    ij = X_y_ij_c[2]
    uniq_classes = X_y_ij_c[3]
    filter = [s==uniq_classes[ij[0]] or s==uniq_classes[ij[1]] for s in y]
    return ([ij[0], ij[1]], ( X[filter], np.where(y[filter]==ij[0], 0, 1)))

Basically I need to create (n_class*(n_class-1))/2 datasets, where each ds only contains 2 of the classes (and we return '0' for the first class and '1' for the second). I was wondering if it is possible to make this for-loop-heavy implementation any quicker with some clever matrix manipulations using cupy as an example. This would require knowledge about what exactly CUDA and cupy can do, and worst of all be clever about it (which I find difficult).

I have also briefly tried using numba.jit to compile filter_task on the fly bit it only sees to slow thing down of the CPU only solution and if I just replace numpy with cupy everywhere, numba just refuses to run and I get some errors.

I would love the help. Thank you

edited May 30 at 2:04

asked May 30 at 1:20

user30013477

112 bronze badges

"filter = [s==uniq_classes[ij[0]] or s==uniq_classes[ij[1]] for s in y]" Why don't you vectorize this comparison in NumPy? E.g. (y == uniq_classes[ij[0]]) | (y == uniq_classes[ij[1]])

Nick ODell
– Nick ODell

2025-05-30 05:40:21 +00:00
Commented May 30 at 5:40
I dont think that works. I thought vector comparisons with more than one elements are ambiguous and will always return True.

user30013477
– user30013477

2025-06-02 00:22:09 +00:00
Commented Jun 2 at 0:22

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Filter data into binary classes on GPU

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest