2
$\begingroup$

I want to use a clustering algorithm which can catch the following within a multivariate binary dataset. In the sample below, since class 1 and 2 appear twice within column A and B they will form a cluster. The same will be for class 5 and 6. Class 3 and 4 will belong to a cluster which is located closer to class 1 and 2 since column B has class 1 to 4. Is hierarchical clustering an appropriate technique to display this kind of relationship?

The data are as follow:

A B C D
class1 1 1 0 0
class2 1 1 0 0
class3 0 1 0 0
class4 0 1 0 0
class5 0 0 1 1
class6 0 0 1 1
$\endgroup$

1 Answer 1

0
$\begingroup$

Yes, hierarchical clustering will be appropriate for this. There are many different methods you can use (agglomerative etc.) which I won't go into.

The way to think about this is by looking at the distance between rows.

  • Class 1 and 2 get grouped together because the distance between their rows is zero. (They have the same elements).
  • Class 5 and 6 get grouped together for the same reason.
  • Class 3 and 4 get grouped together for the same reason.
  • Cluster 3&4 is closer to 1&2 than 5&6 because the distance between the rows is smaller.
  • For example, if our distance metric is just the sum of row-wise differences then the distance from 1&2 to 3&4 is 1 while the distance from 5&6 to 3&4 is 3.

So the two choices you need to make are:

  1. Which clustering algorithm to use.
  2. What distance metric to use.
$\endgroup$
1
  • $\begingroup$ thank you, it helps. Do you know if it is possible to express the probability that a class will be selected with another class? FYI, I use Scipy with Pyhton to do this. $\endgroup$ Commented Oct 14, 2021 at 3:08

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.