DataLoader issues

#1

Hi Team,
I have a multitask learning model. The input dataset is a Pandas dataframe having 3 columns - Name, Label, Rating. Each row in Name is a list of indices representing a particular word in the vocabulary. Eg. Name at row 0 has value = [32, ]

I have created a DataLoader function to pick batches of data from Name, Label and Rating.

from torch.utils import data

class Dataset(data.Dataset):
    #Characterizes a dataset for PyTorch'
    def __init__(self, name, label, rating):
        'Initialization'
        self.label = label
        self.rating = rating
        self.name= name
        self.len = name.shape[0]


    def __len__(self):
        'Denotes the total number of samples'
        return self.len

    def __getitem__(self, index):
        'Generates one sample of data'
        return self.name[index], self.label[index], self.rating[index]

The code to execute the dataloader is given below

batch_training_set = Dataset(training_set['Name'].values, training_set['Label'].values, training_set['Rating'].values)
data_train_loader = DataLoader(dataset=batch_training_set, batch_size=5, shuffle=False, num_workers=8)
for epoch in range(10):
    # initialize hidden state
    h = model.init_hidden()
    for name, label, rating in data_train_loader:
        label_pred, rating_pred, hidden = model(name, h)

This type of each row of the series df[‘Name’] is List. The values look like this:
[41371744, 41392624, 41395700, 40900963, 41375236, 40900965, 41392636, 41395722, 41395679, 41394992, 41392624, 41347884, 41390331, 41212312]
Should I be converting them to ndarray?

Also in this for name, label, rating in data_train_loader:, isn’t the value returned in each iteration a list of tensors where the number of tensor representation of rows is equal to the batch_size that I have chosen(in this case 5)? So isn’t it correct if name is actually a list of tensors like this [tensor([25381763, 41395623, 40952159, 41342348, 41395721]), tensor([41335608, 41366986, 41386228, 41392861, 41386037]), tensor([41395497, 41394967, 41394252, 41395512, 39724715])] ? But with this I am getting the following error

Traceback (most recent call last):
  File "/home/centos/.pycharm_helpers/pydev/pydevd.py", line 1741, in <module>
    main()
  File "/home/centos/.pycharm_helpers/pydev/pydevd.py", line 1735, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/home/centos/.pycharm_helpers/pydev/pydevd.py", line 1135, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/centos/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/centos/project/model_dir/lstm_model.py", line 213, in <module>
    label_pred, rating_pred, hidden = model(name, h)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/centos/project/model_dir/lstm_model.py", line 135, in forward
    embed = self.embedding(sentence)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/lib64/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not list

My understanding is that I shouldn’t be passing a list of tensors but if I don’t do that then what is the purpose of dataloader?

(Arunava Chakraborty) #2

Skimmed through your errors.
Use torch.stack
Here: torch.stack

Hope this helps!

#3

Tried torch.stack and the issue TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not list got solved but now there is another error in the embedding layer

  File "/home/centos/project/model_dir/lstm_model.py", line 136, in forward
    embed = self.embedding(sent)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/lib64/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191

I am using a pretrained word2vec embedding built using Gensim.

def __init__(self, pretrained_emb, no_of_classes, hidden, no_layers):
    
    self.embedding = nn.Embedding.from_pretrained(pretrained_emb)

The error occurs at

 def forward(self, sentence, hidden):
      embed = self.embedding(sentence) #Error occurs here

Each word is embedded into a 128 dim vector. Shape of sentence is torch.Size([11, 5]), when batch size of dataloader is 5.

Same error even with batch size = 1