How to integer coding values for text data?

Question

I've been looking at how to prepare dataset for deep learning models.

If we have a data like this,

data = [['this', 'is'], ['not', 'with']]

first they get the frequency of words in our corpus. Based on a word frequency integer label was assigned to word.

The word which is more frequent got assigned 1, then 2 and so on..

My question is why do we need to do that? Can't we just randomly assigned integer values for words. Does it increase accuracy if we following that rule.

Sam Mason · Accepted Answer · 2020-01-25 13:53:50Z

1

I doubt it has any effect on accuracy, unless maybe you're doing something unusual later on

I could see it having effects on:

performance: common words will be clustered together (near zeroth index) and hence likely to end up in cache together
human interpretation/readability: strings/display output will tend to be "tidier" with common words needing less digits
easy handling of rare words; all index values over some threshold indicate the word is rare and can be mapped to some placeholder / ignored (depending on how the model handles this)

answered Jan 25, 2020 at 13:53

Sam Mason

16.5k1 gold badge49 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to integer coding values for text data?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related