Simple code runs on GPU but CPU

I’m not sure this is a bug or not.

I need to deploy the AWS elastic inference for our service. The Elastic Inference requires using CPU to load and run models.

but our code runs well on GPUs, but CPU.

as the simple code below, it can be run on GPUs but on CPU it returns “index out of range in self” Error.

CPUs returns index out of range in self error

import numpy as np
import torch
import torch.nn as nn

sinusoid_table = torch.FloatTensor(torch.Size([50 + 1, 512]))

pos_emb = nn.Embedding.from_pretrained(sinusoid_table, freeze=True)
positions = torch.arange(200).expand(1, 200).contiguous()+1
positions=positions
a= pos_emb(positions)
print(a)

on GPUs this run well

import torch
import torch.nn as nn

device = torch.device(‘cuda:0’)

sinusoid_table = torch.FloatTensor(torch.Size([50 + 1, 512])).to(device)
pos_emb = nn.Embedding.from_pretrained(sinusoid_table, freeze=True).to(device)
positions = torch.arange(200).expand(1, 200).contiguous()+1
positions=positions.to(device)
a= pos_emb(positions)
print(a)

I’m using PytorchECL 1.3.1 (support Elastic Inference) comes with Amazon AMI

Tested with normal pytorch==1.5.0

both of them result the “index out of range error”

Any helps is appreciated. thank you.

The code raises an error on the CPU as well as the GPU using the latest stable version (1.7.0).
Note that assert statements in CUDA code were mostly disabled in 1.5.0 due to a bug, so that your code doesn’t raise the proper error if you are using the GPU.

The reason for the error is that your weight lookup table in the embedding layer contains feature vectors for 51 indices, so your inputs are limited to [0, 50], while you are passing positions with values in the range [1, 200].

Also, note that torch.FloatTensor(torch.Size([51, 512])) will create a tensor in the specified shape with uninitialized values, which might contain invalid values such as NaNs and Infs, so you should properly initialize sinusoid_table or use a tensor factory method to create it such as torch.randn.

Thank your sir.

Do you mean, with pytorch 1.5.0, it does have error, just disabled?
But our project is running well. It’s so strange.

I did initialized with random values, just for simplicity, I removed the initializing script

Anyway, thank for your helps. I will try to fix this.

Yes, device assert statements were unfortunately disabled by accident (this should be fixed in 1.5.1 and all newer versions).
I would strongly recommend to update PyTorch and to make sure your code is running properly.
Your current code might run into an illegal memory access without reporting it and might thus create major issues.

1 Like