What features to extract to cluster text?

Question

I want to make a classifier for text, which is further use to suggest the most similar text for a one given.

The flow of the app is the following:

extract the main 10 topics from the text, using a llm (it can choose from a 150 words pool)
I make the word vector a binary vector, basically working in a 150 dimensional space, where each text have a coordinate like this [1, 0, 1, ..., 0]
then I find the closest neighbour (I want to extend to 3-5, but for simplicity, let's assume it is only one) using cosine distance
I receive the closest text

The problem is that the texts are pretty different, and the llm gives the topics pretty well, but the suggested texts are not exactly what I was expecting. I tried to order the topics based on importance and make the vector non-binary ([10, 0, 0, 9, ..., 1]), but that didn't seem to help a lot.

I was wondering wheter this approach is not good for my problem, or if I should use other parameters or anything else for grouping my texts.

EliasK93 · Accepted Answer · 2024-09-11 15:45:03Z

0

If you are already using LLMs this means you need a lot of compute power, so it does not seem like a good idea to me to then circle back to a simple binary vector and use that for the actual clustering, since you might have a lot of information loss from that step compared to how well the LLM actually encoded the semantics.

It would probably be much more efficient to either use something like SentenceTransformers for embedding + k-Means Clustering if you just want clusters/groups or use something like FAISS to efficiently create and perform similarity search in a vector database (a database of all embedded documents). If the latter is too much of a hassle you can also just use any library that allows you to calculate similarity metrics between vectors and apply this to the (normalized) embedded documents.

answered Sep 11, 2024 at 15:45

EliasK93

3,1861 gold badge7 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

20 Comments

will Over a year ago

I mean, I choose to use llms out of other options. I'm pretty new to nlp, and if I knew a better solution to find relative documents from a database I would use it. The problem is, that this methods was the recommandation that I received and at first it was the best solution. If you know any other solutions that will fit my case, I would try them.

will Over a year ago

The problem with the provided links is that for my use case, I need to relate as much as posible on libraries, and implement the methods by myself as much as possible. I know that llms don't count, so that why I used them. The library provided abstract too much things and even if they help me out provide a solution, it wouldn't be taken in account.

EliasK93 Over a year ago

I see, I mean at the end of the day, all the LLM does in your case is to convert a text/document/sentence to a high-dimensional semantic embedding. The part afterwards, e.g. k-Means clustering of the vectors is actually very easy to fully code yourself, check out e.g. stackoverflow.com/questions/53508331/… (if NumPy is okay; pure Python is also possible).

EliasK93 Over a year ago

So what you could do is first use some very basic approach to encode each document to a vector, e.g. TF-IDF, which can also be done in pure python, see github.com/geekan/pytfidf/blob/master/tfidf.py, then apply the k-Means to it. You can do the entire thing with just NumPy or even in pure Python. Just be aware that this is very basic and won't come close to the clustering quality of a more modern LLM-based clustering approach. If you're allowed to use llm in the way you described that would surely improve the quality but still be worse than the full LLM-approach.

will Over a year ago

can you further explain what do you mean by ‘full LLM’ approach?

|

Collectives™ on Stack Overflow

What features to extract to cluster text?

1 Answer 1

20 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

20 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related