I want to make a classifier for text, which is further use to suggest the most similar text for a one given.
The flow of the app is the following:
- extract the main 10 topics from the text, using a
llm(it can choose from a 150 words pool) - I make the word vector a binary vector, basically working in a 150 dimensional space, where each text have a coordinate like this
[1, 0, 1, ..., 0] - then I find the closest neighbour (I want to extend to 3-5, but for simplicity, let's assume it is only one) using
cosinedistance - I receive the closest text
The problem is that the texts are pretty different, and the llm gives the topics pretty well, but the suggested texts are not exactly what I was expecting. I tried to order the topics based on importance and make the vector non-binary ([10, 0, 0, 9, ..., 1]), but that didn't seem to help a lot.
I was wondering wheter this approach is not good for my problem, or if I should use other parameters or anything else for grouping my texts.