Pytorch convert a pd.DataFrame which is variable length sequence to tensor

Question

I get a pandas DataFrame as follows and want to convert it to torch.tensor for embedding.

# output first 5 rows examples
print(df['col'].head(5))

                      col
0             [a, bc, cd]
1      [d, ed, fsd, g, h]
2  [i, hh, ihj, gfw, hah]
3                 [a, cb]
4                   [sad]



train_tensor = torch.from_numpy(train)

But it gets an error:

TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, int64, int32, int16, int8, uint8, and bool.

It seems that from_numpy() doesn't support the variable lenght sequences.
So if want to initialize tensor form it what is the proper way?
And after getting the corresponding tensor I will try to add padding to variable length sequences and do embedding layer for it.
Could anyone help me?
Thanks in advances.

what exactly is train ? And what are those 5 literal arrays? Can we get a more precise code snippets? — Jean Bouvattier
– Jean Bouvattier, Commented Jul 17, 2020 at 7:30

mujjiga · Accepted Answer · 2020-07-17 08:07:37Z

There are multiple steps involved here

words to IDs

Pretrained: If you are using a pretrained embeddings like Glove/word2vec you will have to map each word to its ID in the vocabulary so that the embedding layer can load the pretrained embeddings.
In case you want to train your own embeddings you will have to map each word to an ID and save the map for later use (during predictions). This is normally called vocabulary

# Vocabulary to our own ID
def to_vocabulary_id(df):
  word2id = {}
  sentences = []
  for v in df['col'].values:
    row = []
    for w in v:
      if w not in word2id:
        word2id[w] = len(word2id)+1
      row.append(word2id[w])
      
    sentences.append(row)
  return sentences, word2id


df = pd.DataFrame({'col': [
                           ['a', 'bc', 'cd'], 
                           ['d', 'ed', 'fsd', 'g', 'h'], 
                           ['i', 'hh', 'ihj', 'gfw', 'hah'],
                           ['a', 'cb'],
                           ['sad']]})
sentences, word2id = to_vocabulary_id(df)

Embedding layer

If our vocabulary size is say 100 and embedding size is 8, then we will create an embedding layer as below

embedding = nn.Embedding(100, 8)

Pad variable length sentences to 0 and create Tensor

data = pad_sequence([torch.LongTensor(s) for s in sentences], batch_first=True, padding_value=0)

Run through the embedding layer

Finally

import torch
from torch.nn.utils.rnn import pad_sequence
        
data = pad_sequence([torch.LongTensor(s) for s in sentences], batch_first=True, padding_value=0)

embedding = nn.Embedding(100, 8)
embedding(data).shape

Output:

torch.Size([5, 5, 8])

As you can see we have passed 5 sentences and the max length is 5. So we get embeddings of size 5 X 5 X 8 ie. 5 sentences, 5 words each one having embedding of size 8.

Victor Zuanazzi · Accepted Answer · 2020-07-17 07:51:50Z

1

There is a number of issues with what you are wanting to do:

Torch tensors (as described in the error) do no store strings, only numbers.
Torch tensors are mathematical tensors (multi dimensional matrices), which means that it has a well defined shape (you cannot store roles of different lenghs).

I would recommend you taking a look on how to train NLP (Natura Language Processing) models in one of this turorials: https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html They cover theory and practice of word2vec techniques and how to use it for different machine learning tasks.

I hope that helps =)

answered Jul 17, 2020 at 7:51

Victor Zuanazzi

2,0241 gold badge17 silver badges32 bronze badges

Collectives™ on Stack Overflow

Pytorch convert a pd.DataFrame which is variable length sequence to tensor

2 Answers 2

words to IDs

Embedding layer

Pad variable length sentences to 0 and create Tensor

Run through the embedding layer

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

words to IDs

Embedding layer

Pad variable length sentences to 0 and create Tensor

Run through the embedding layer

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related