My model is erroring out on validation data but not test data

Jordan_Howell · December 26, 2019, 1:14pm

Hello,

I have trained and tested a model on categorical and numerical data. Now I’m trying to run my validation data through the model with the same script as my test data but it’s erroring out.

Here is the script I’m running for the validation run:

#instantiate the model class
model = Data_Only_Model(embedding_size=train_categorical_embedding_sizes
                              ,num_numerical_cols=val_numerical_data.shape[1], output_size = 2, layers = [256,128,64,32])
#load the model from the hard drive
model.load_state_dict(torch.load('D:\\CIS inspection images 0318\\self_build\\data_only_model.pt'))
model = model.eval()
model = model.cuda()

#make predictions
val_preds = []
with torch.no_grad():
    val_categorical_data = val_categorical_data.cuda()
    val_numerical_data = val_numerical_data.cuda()
    val_target = val_target.cuda()
    y_val = model(val_categorical_data, val_numerical_data)
    y_val = y_val.data
    val_preds.append(np.exp(y_val.cpu().data.numpy()))
    loss = criterion(y_val, val_target)
print(f'Loss: {loss:.8f}')

Here is the error traceback:

RuntimeError                              Traceback (most recent call last)
<ipython-input-181-72bf422e9737> in <module>
      4     val_numerical_data = val_numerical_data.cuda()
      5     val_target = val_target.cuda()
----> 6     y_val = model(val_categorical_data, val_numerical_data)
      7     y_val = y_val.data
      8     val_preds.append(np.exp(y_val.cpu().data.numpy()))

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

<ipython-input-157-ea50eaf0e182> in forward(self, x_categorical, x_numerical)
     57         #concatenating numerical and categorical columns
     58         x = torch.cat([x, x_numerical], 1)
---> 59         x = self.layers(x)
     60         x = F.log_softmax(x)
     61         return x

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\linear.py in forward(self, input)
     85 
     86     def forward(self, input):
---> 87         return F.linear(input, self.weight, self.bias)
     88 
     89     def extra_repr(self):

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py in linear(input, weight, bias)
   1368     if input.dim() == 2 and bias is not None:
   1369         # fused op is marginally faster
-> 1370         ret = torch.addmm(bias, input, weight.t())
   1371     else:
   1372         output = input.matmul(weight.t())

RuntimeError: CUDA error: device-side assert triggered```

I'm not sure what I'm doing wrong.

ptrblck · December 27, 2019, 3:40am

Could you run the code on the CPU as this might be a clearer error message?

Jordan_Howell · December 27, 2019, 1:48pm

When I run this on cpu, I get the following:

I’ve tried this 3 times with the same result.

Jordan_Howell · December 27, 2019, 2:21pm

Ok. I think I’ve found something. When I try to load the model and the parameters, I have to load the model with the training categorical embedding sizes. When I don’t load the model via model.Load_state_dict(), and just run the validation data through a instantiated model with the validation embedding sizes, it works (via CPU, I haven’t tried GPU). But I can’t load a state_dict with validation data embedding sizes and can’t run validation data through with training embedding training sizes.

Does that make sense?

If it makes sense, I guess this brings up a question as to how to train a model that will run with new categories inside the categorical variables and thus, produce different categorical embedding sizes that could be in new data brought in.