0

I've been looking at how to prepare dataset for deep learning models.

If we have a data like this,

data = [['this', 'is'], ['not', 'with']]

first they get the frequency of words in our corpus. Based on a word frequency integer label was assigned to word.

The word which is more frequent got assigned 1, then 2 and so on..

My question is why do we need to do that? Can't we just randomly assigned integer values for words. Does it increase accuracy if we following that rule.

1 Answer 1

1

I doubt it has any effect on accuracy, unless maybe you're doing something unusual later on

I could see it having effects on:

  • performance: common words will be clustered together (near zeroth index) and hence likely to end up in cache together
  • human interpretation/readability: strings/display output will tend to be "tidier" with common words needing less digits
  • easy handling of rare words; all index values over some threshold indicate the word is rare and can be mapped to some placeholder / ignored (depending on how the model handles this)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.