Embedding Error Index out of Range in self

I tried the stackoverflow and other threads in forum but still my issues wasn’t resolved. I am a starter please help me understand what went wrong.

id_2_token = dict(enumerate(set(n for name in names for n in name),1))
token_2_id = {value:key for key,value in id_2_token.items()}
print(len(id_2_token))
print(len(token_2_id))

Output :

56
56
feature_id,target_id = batch_maker(names) #batching function
print(feature_id.shape) #Shape - [124,64,17]

#RNN MODEL

class CharMaker(nn.Module):
    def __init__(self, input_size, hidden_size, output_size,n_layers=1):
        super(CharMaker,self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.encoder = nn.Embedding(self.input_size, self.hidden_size)
        
        self.rnn = nn.RNN(self.hidden_size,self.hidden_size, num_layers=1,batch_first=True)
        self.linear = nn.Linear(self.hidden_size, self.output_size)
        
        self.softmax = torch.nn.Softmax(dim=output_size)
        
    def forward(self, inputs, hidden):
        batch_size = inputs.size(0)
        
        if hidden == None:
            hidden = torch.zeros(1,inputs.size(1),self.hidden_size)
        print(inputs.shape)
        encoded = self.encoder(inputs)
        output, hidden = self.rnn(encoded, hidden)
        outout = self.linear(hidden,self.output_size)
        
        output = self.softmax(output)
        
        return output,hidden

Initializing my model

cm = CharMaker(input_size=len(token_2_id),hidden_size=20,output_size=len(token_2_id))

Reshaping and Texting The Data

hidden = None
names_id_tensor = torch.from_numpy(features_id[0])
names_id_tensor = names_id_tensor.reshape(names_id_tensor.shape[0],names_id_tensor.shape[1],1)

Shapes

print(names_id_tensor.shape) #torch.Size([64, 17, 1])
output,hidden = cm(names_id_tensor,hidden)

Error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-139-d0d9f66f3192> in <module>
----> 1 output,hidden = cm(names_id_tensor,hidden)

~/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

<ipython-input-129-f8a6cdd31a7a> in forward(self, inputs, hidden)
     19             hidden = torch.zeros(1,inputs.size(1),self.hidden_size)
     20         print(inputs.shape)
---> 21         encoded = self.encoder(inputs)
     22         output, hidden = self.rnn(encoded, hidden)
     23         outout = self.linear(hidden,self.output_size)

~/.local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/.local/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    112         return F.embedding(
    113             input, self.weight, self.padding_idx, self.max_norm,
--> 114             self.norm_type, self.scale_grad_by_freq, self.sparse)
    115 
    116     def extra_repr(self):

~/.local/lib/python3.6/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1722         # remove once script supports set_grad_enabled
   1723         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1724     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1725 
   1726 

IndexError: index out of range in self
1 Like

Could you print the min and max values of names_id_tensor?
The embedding_dim is currently set as self.input_size, which is the length of token_2_id.
Note that the embedding dimension refers to the max. input index you would like to provide.

Hi @ptrblck, I understand what are you trying to say but I am having an issue with that.

  1. This was just a batch for testing the model if it works or not. The highest value in that batch was 53 while my vocab(token_2_id) size is 56. What if another batch comes up with the highest value other than 53, what will happen then? How will I resolve that problem?

Can you please guide me through this?

print(torch.max(names_id_tensor))
print(torch.min(names_id_tensor))

Output

tensor(53)
tensor(1)

You cannot pass indices higher than embedding_dim-1, since the embedding layer is working as a lookup table. The input is used to index the corresponding embedding vector, so you should set embedding_dim as the highest value you would expect in your use case.

5 Likes

Got it. I had to just fix the embedding layer. @ptrblck Thanks it’s solved now.

Last time my vocab was create by enumerating from 1. So if I just enumerate from 0 I can keep the same embedding otherwise if I had insisted on keeping enumeration from 1. then all I had to do was:

self.encoder = nn.Embedding(self.input_size+1, self.hidden_size) #[57,20] but still 56 dimensions as 0th index is still empty.

Solution :

class CharMaker(nn.Module):
    def __init__(self, input_size, hidden_size, output_size,n_layers=1):
        super(CharMaker,self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        print("Input & Output Size", input_size,output_size)
        print("Hidden Size ", hidden_size )
        
        self.encoder = nn.Embedding(self.input_size, self.hidden_size) #[56,20]
        
        self.rnn = nn.RNN(self.hidden_size,self.hidden_size, num_layers=1,batch_first=True) #[20,20]
        self.linear = nn.Linear(self.hidden_size, self.output_size) #[20,56]
        
        
    def forward(self, inputs, hidden):
        batch_size = inputs.size(0)
        
        if hidden == None:
            hidden = torch.zeros(1,batch_size,self.hidden_size)
        #print("Original Input : ",inputs.shape)
        encoded = self.encoder(inputs)
        #print("Encoded Input : ",encoded.shape)
        output, hidden = self.rnn(encoded, hidden)
        output = self.linear(output)
        
        return output,hidden
2 Likes

When using random data, I noticed, nn.Embedding only requires posetive numbers to be passed. This error gone if used .abs() or using any natural number.

@ptrblck suppose the distinct values for my seq is 0,1,2 .
What should size of embeddings here
3 or 2…

The num_embeddings should be 3 so that all indices can be used to index the weight matrix.

1 Like

even my indices is lower than embedding_dim-1 within a batch , i am still getting IndexError: index out of range in self , here is my data and code. In each batch for Age category size is 20 and input embedding size is 70 , dont know why indexing error is throwing

data = pd.read_csv('Churn_Modelling.csv')
print("Shape:", data.shape)
data.head()

X_train = data[['Age','Balance']]
y_train = pd.DataFrame(data['Exited'])
X_train

Shape: (10000, 14)
	Age  	Balance
	----	-------
0	 42	        0.00
1	41	    83807.86
2	42	   159660.80
3	39	        0.00
4	43	   125510.82

10000 rows × 2 columns

y_train

	Exited
	-------
0	1
1	0
2	1
3	0
4	0

10000 rows × 1 columns
features  = ['Age']
for col in features:
    X_train.loc[:,col] = X_train.loc[:,col].astype('category')
X_train.dtypes
Age        category
Balance     float64
dtype: object
embedded_cols = {n: len(col.cat.categories) for n,col in X_train[features].items()}
embedded_cols

{'Age': 70}


 class ShelterOutcomeDataset(Dataset):
     def __init__(self, X, Y, embedded_col_names):
         X = X.copy()
         self.X1 = X.loc[:,embedded_col_names].copy().values.astype(np.int64) #categorical columns
         self.X2 = X.drop(columns=embedded_col_names).copy().values.astype(np.float32) #numerical columns
         self.y = Y.copy().values.astype(np.int64)
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return self.X1[idx], self.X2[idx], self.y[idx]

embedding_sizes = [(n_categories, min(50, (n_categories+1)//2)) for _,n_categories in embedded_cols.items()]
embedding_sizes

[(70, 35)]   

train_ds = ShelterOutcomeDataset(X_train,y_train , ['Age'])

class testNet(nn.Module):
    def __init__(self, emb_dims, n_cont):
        super().__init__()
        
        self.embeddings = nn.ModuleList([nn.Embedding(categories, size) for categories,size in emb_dims])
        no_of_embs = sum(e.embedding_dim for e in self.embeddings) #length of all embeddings combined
        self.n_emb, self.n_cont = no_of_embs, n_cont
        self.lin1 = nn.Linear(self.n_emb + self.n_cont,6)
        self.lin2 = nn.Linear(6, 4)
        self.lin3 = nn.Linear(4, 2)

        self.bn1 = nn.BatchNorm1d(self.n_cont)
        self.bn2 = nn.BatchNorm1d(6)
        self.bn3 = nn.BatchNorm1d(4)
        self.emb_drop = nn.Dropout(0.6)
        self.drops = nn.Dropout(0.3)

    def forward(self, x_cat, x_cont):

        x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
        x = torch.cat(x, 1)
        
        x = self.emb_drop(x)
        # batch normalization over continous features
        x2 = self.bn1(x_cont)
        # concatenate both embedding and continous feature , here 1 means dim 
        # the dimension over which the tensors are concatenated we are concatenating columns
        x = torch.cat([x, x2], 1)
        x = F.relu(self.lin1(x))
        x = self.drops(x)
        x = self.bn2(x)
        x = F.relu(self.lin2(x))
        x = self.drops(x)
        x = self.bn3(x)
        x = self.lin3(x)

        return x
        
import torch.nn as nn

criterion = nn.CrossEntropyLoss()
def train_model(model, optim, train_dl):
    model.train()
    total = 0
    sum_loss = 0
    for cat, cont, y in train_dl:
        batch = y.shape[0]
        
        print(cat.size())  # <--- size of features whihc has to be embeded

        y = y.to(torch.float32)
        
        output = model(cat, cont)
        _,pred = torch.max(output,1)
        
        loss = criterion(output, y.squeeze(1).long())
        optim.zero_grad()
        loss.backward()
        optim.step()
        total += batch
        sum_loss += batch*(loss.item())
    return sum_loss/total,pred



def train_loop(model, epochs, lr=0.01, wd=0.0):
    optim = get_optimizer(model, lr = lr, wd = wd)
    for epoch in range(epochs): 
        loss,pred = train_model(model, optim, train_dl)
        if (epoch+1) % 5 ==0:
            print(f'epoch : {epoch+1},training loss : {loss}, output : {output}')

batch_size = 20
train_dl = DataLoader(train_ds, batch_size=batch_size,shuffle=True)
#valid_dl = DataLoader(valid_ds, batch_size=batch_size,shuffle=True)    

train_dl = DeviceDataLoader(train_dl, device)
# valid_dl = DeviceDataLoader(valid_dl, device)

# model = ShelterOutcomeModel(embedding_sizes,0)

model = testNet(embedding_sizes,1)
print(model)

from collections import defaultdict
opt = torch.optim.Adam(model.parameters(), lr=1e-2)
# to_device(model, device)

train_loop(model, epochs=100, lr=0.01, wd=0.00001)

testNet(
  (embeddings): ModuleList(
    (0): Embedding(70, 35)
  )
  (lin1): Linear(in_features=36, out_features=6, bias=True)
  (lin2): Linear(in_features=6, out_features=4, bias=True)
  (lin3): Linear(in_features=4, out_features=2, bias=True)
  (bn1): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn3): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (emb_drop): Dropout(p=0.6, inplace=False)
  (drops): Dropout(p=0.3, inplace=False)
)
torch.Size([20, 1])

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3281-888e52d4559c> in <module>
     74 # to_device(model, device)
     75 
---> 76 train_loop(model, epochs=100, lr=0.01, wd=0.00001)

<ipython-input-3281-888e52d4559c> in train_loop(model, epochs, lr, wd)
     46     optim = get_optimizer(model, lr = lr, wd = wd)
     47     for epoch in range(epochs):
---> 48         loss,pred = train_model(model, optim, train_dl)
     49         if (epoch+1) % 5 ==0:
     50             print(f'epoch : {epoch+1},training loss : {loss}, output : {output}')

<ipython-input-3281-888e52d4559c> in train_model(model, optim, train_dl)
     15 
     16 
---> 17         output = model(cat, cont)
     18         _,pred = torch.max(output,1)
     19 

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

<ipython-input-3280-681fc4d5712d> in forward(self, x_cat, x_cont)
     30 
     31 
---> 32         x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
     33         x = torch.cat(x, 1)
     34 

<ipython-input-3280-681fc4d5712d> in <listcomp>(.0)
     30 
     31 
---> 32         x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
     33         x = torch.cat(x, 1)
     34 

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    122 
    123     def forward(self, input: Tensor) -> Tensor:
--> 124         return F.embedding(
    125             input, self.weight, self.padding_idx, self.max_norm,
    126             self.norm_type, self.scale_grad_by_freq, self.sparse)

~/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1812         # remove once script supports set_grad_enabled
   1813         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1814     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1815 
   1816 

IndexError: index out of range in self

An index value of 70 for an embedding layer size of 70 won’t work, since the valid indices would be in the range [0, 69], so you would either need to increase the num_embeddings value or clip the input.

@ptrblck even i increase +1 into num_embeddings i am getting same error

self.embeddings = nn.ModuleList([nn.Embedding(categories+1, size) for categories,size in emb_dims])

Interesting thing is when i increase it by 23 it does not give error

class testNet(nn.Module):
    def __init__(self, emb_dims, n_cont):
        super().__init__()

        for categories,size in emb_dims:
            print(f'catagrorize is {categories}, size is {size}')
        
        self.embeddings = nn.ModuleList([nn.Embedding(categories+23, size) for categories,size in emb_dims])
        no_of_embs = sum(e.embedding_dim for e in self.embeddings) #length of all embeddings combined
        self.n_emb, self.n_cont = no_of_embs, n_cont
        self.lin1 = nn.Linear(self.n_emb + self.n_cont,6)
        self.lin2 = nn.Linear(6, 4)
        self.lin3 = nn.Linear(4, 2)

        self.bn1 = nn.BatchNorm1d(self.n_cont)
        self.bn2 = nn.BatchNorm1d(6)
        self.bn3 = nn.BatchNorm1d(4)
        self.emb_drop = nn.Dropout(0.6)
        self.drops = nn.Dropout(0.3)
        
    def forward(self, x_cat, x_cont):
        # take the embedding list and grab an embedding and pass in our single row of data.
        x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
        x = torch.cat(x, 1)
        
        x = self.emb_drop(x)
        x2 = self.bn1(x_cont)
        x = torch.cat([x, x2], 1)
        x = F.relu(self.lin1(x))
        x = self.drops(x)
        x = self.bn2(x)
        x = F.relu(self.lin2(x))
        x = self.drops(x)
        x = self.bn3(x)
        x = self.lin3(x)

        return x
import torch.nn as nn

criterion = nn.CrossEntropyLoss()
def train_model(model, optim, train_dl):
    model.train()
    total = 0
    sum_loss = 0
    for cat, cont, y in train_dl:
        batch = y.shape[0]

        y = y.to(torch.float32)
        
        output = model(cat, cont)
        _,pred = torch.max(output,1)

        loss = criterion(output, y.squeeze(1).long())
        optim.zero_grad()
        loss.backward()
        optim.step()
        total += batch
        sum_loss += batch*(loss.item())
    return sum_loss/total,pred


def train_loop(model, epochs, lr=0.01, wd=0.0):
    optim = get_optimizer(model, lr = lr, wd = wd)
    for epoch in range(epochs): 
        loss,pred = train_model(model, optim, train_dl)
        if (epoch+1) % 5 ==0:
            print(f'epoch : {epoch+1},training loss : {loss}, output : {output}')

batch_size = 100
train_dl = DataLoader(train_ds, batch_size=batch_size,shuffle=True)

train_dl = DeviceDataLoader(train_dl, device)



print(f'embedding_sizes is {embedding_sizes}')
model = testNet(embedding_sizes,1)


from collections import defaultdict
opt = torch.optim.Adam(model.parameters(), lr=1e-2)


train_loop(model, epochs=100, lr=0.01, wd=0.00001)
embedding_sizes is [(70, 35)]
catagrorize is 70, size is 35
epoch : 5,training loss : 0.4648001512885094, output : 0.6111002564430237
epoch : 10,training loss : 0.4541498306393623, output : 0.6111002564430237
epoch : 15,training loss : 0.45384191155433656, output : 0.6111002564430237
epoch : 20,training loss : 0.45079687386751177, output : 0.6111002564430237
epoch : 25,training loss : 0.4511949673295021, output : 0.6111002564430237
epoch : 30,training loss : 0.45295464009046554, output : 0.6111002564430237
epoch : 35,training loss : 0.45299509912729263, output : 0.6111002564430237
epoch : 40,training loss : 0.45105998665094377, output : 0.6111002564430237
epoch : 45,training loss : 0.4528631994128227, output : 0.6111002564430237
epoch : 50,training loss : 0.4509485891461372, output : 0.6111002564430237
epoch : 55,training loss : 0.4534462735056877, output : 0.6111002564430237
epoch : 60,training loss : 0.4507604452967644, output : 0.6111002564430237
epoch : 65,training loss : 0.4527029529213905, output : 0.6111002564430237
epoch : 70,training loss : 0.4511090362071991, output : 0.6111002564430237
epoch : 75,training loss : 0.4510712164640427, output : 0.6111002564430237
epoch : 80,training loss : 0.4523083609342575, output : 0.6111002564430237
epoch : 85,training loss : 0.4539755055308342, output : 0.6111002564430237
epoch : 90,training loss : 0.4536020648479462, output : 0.6111002564430237
epoch : 95,training loss : 0.4528249257802963, output : 0.6111002564430237
epoch : 100,training loss : 0.45215955764055255, output : 0.6111002564430237

This could mean that the max. index is larger than you expect, so you could add assert statements to your code to narrow down the largest values.

1 Like

@ptrblck thanks for pointing me out , could you please give me this statement as i am very new to pytorch , how and where i need to add assert statements?

In your training loop you could use something like this:

# works
target = torch.tensor([69, 68])
assert (target<70).all(), "target: {} invalid".format(target)

# fails
target = torch.tensor([69, 70])
assert (target<70).all(), "target: {} invalid".format(target)

which would trigger the errors for target values larger than the used threshold.
Once you are running into these errors, check why you expect them to be <70 and why they are apparently not in this range.

1 Like

@ptrblck thanks mate , i work on it and get back to you here if i need more assitance - you are great

@ptrblck , can you please help me where to put your above suggested code , as you mentioned in training loop then i guess it could be like

def train_loop(model, epochs, lr=0.01, wd=0.0):
    optim = get_optimizer(model, lr = lr, wd = wd)
    for epoch in range(epochs): 
        
        target = torch.tensor([69, 68])<---------------
        assert (target<70).all(), "target: {} invalid".format(target)<-------------
        
        loss,pred = train_model(model, optim, train_dl)
        if (epoch+1) % 5 ==0:
            print(f'epoch : {epoch+1},training loss : {loss}, output : {output}')

@ptrblck i fixed issue , actually that was due to not converting Age column into Label Encoder , after transforming into label encoder , data needs to represented by its Index for embedding. Label Encoding is for index before your input to the embedding layer. It means some of Age which were not transformed into label encoder initially should have value greater than 70 or 69 was causing this problem. Is my understanding correct ? Do we really need to convert columns into label encoding before embedding?

You don’t need to use a label encoding, but it can be useful if you want to map your targets to [0, nb_classes-1], which is expected by the embedding layer (and usually also loss functions).
Alternatively, you could also use the max. value of the “unencoded” target and set it as the num_embeddings in case you are expecting to get the missing values in the future.

1 Like

@ptrblck can you please shed some light on it https://discuss.pytorch.org/t/embedding-layer/121969