[solved] Assertion `srcIndex < srcSelectDimSize` failed on GPU for `torch.cat()`

handesy · April 11, 2017, 6:33pm

Hi, I encountered the following assertion error when running my code on GPU (things are fine on CPU):

/b/wheel/pytorch-src/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [179,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/b/wheel/pytorch-src/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [179,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/b/wheel/pytorch-src/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [179,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu line=226 error=59 : device-side assert triggered
Traceback (most recent call last):
....
x = torch.cat([y_tm1_embed.squeeze(0), ctx_tm1], 1)
File "torch/autograd/variable.py", line 836, in cat
return Concat(dim)(*iterable)
File "torch/autograd/_functions/tensor.py", line 310, in forward
return torch.cat(inputs, self.dim)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:226

I used the flag CUDA_LAUNCH_BLOCKING=1. The full code is a simple attention-based encoder-decoder, with y_tm1_embed being the embedding of the previous word, and ctx_tm1 is the previous context vector, which is initialized by:

ctx_tm1 = Variable(torch.zeros(batch_size, self.args.hidden_size * 2), requires_grad=False)
if self.args.cuda:
    ctx_tm1 = ctx_tm1.cuda()

Any suggestions? Thanks!

smth · April 11, 2017, 11:49pm

you are giving an out of bounds index somewhere.

Can you reproduce this error with a small snippet?

Usually these device-side asserts are easier to debug if you run the same code on CPU (i.e. without .cuda()) and you know right away what the out of bounds indices are and where they’re coming from.

handesy · April 11, 2017, 11:52pm

Hi smth, thank you for your reply! The weird thing is that when running the same code on CPU (i.e., without .cuda()) everything is fine. I am trying to reproduce this with a small snippet. May I know if there’re any other possibilities that could cause this error?

handesy · April 12, 2017, 5:05am

update: problem solved. This is due to an out of bounds index in the embedding matrix. Thanks for the help!

py0ker · May 9, 2019, 3:59pm

hello, excuse me, I’d like to know how did you find it? I face the same problem, but I don’t know why it could be out of index bound. thank you!

ptrblck · May 10, 2019, 12:30pm

If you are seeing this error using an nn.Embedding layer, you might add a print statement which shows the min and max values for each input. Some batches might have an out of bounds index.
Once you find, the erroneous batch you should have a look how it was created so that you can fix this error.

Gaurav_Koradiya · June 14, 2019, 12:15pm

Excatly, I solved my problem. Thank u very much.

Chang_Liao · August 1, 2019, 3:15am

How you guys solve this problem? I am facing the same issue. It’s running well on CPU but it doesn’t work on GPU.

verified.human · August 28, 2019, 6:13pm

I think the PyTorch error messages should be improved here. Having the wrong number of classes for nn.Embedding throws a bunch of C++ errors and returns CUDNN_STATUS_NOT_INITIALIZED on the latest version. Quite hard to debug this problem given these non-informative error messages.

ptrblck · August 28, 2019, 9:31pm

CUDA errors might be sometimes cryptic, so I generally recommend to debug the code on the CPU, if possible. If that’s not possible, I would try to execute the script via:

CUDA_LAUNCH_BLOCKING=1 python script.py args

to get the right line of code which raised the error in the stack trace.

addisonklinke · January 28, 2020, 2:52pm

For others stumbling on this thread, be careful to choose a positive index (i.e. 0 instead of -1) for padding sequences as input to an embedding layer. Even if you specify the negative index in the embedding constructor, you will still get a runtime error on both CPU and GPU

import torch
import torch.nn as nn

emb = nn.Embedding(20, 100, padding_idx=-1)
inp = torch.tensor([5, 2, 7, 12, 3])
bad_padding = torch.cat((inp, torch.tensor([-1] * 3)))
good_padding = torch.cat((inp, torch.tensor([0] * 3)))
out = emb(good_padding)
out = emb(bad_padding)  # RuntimeError

Andrew_Brockman · March 19, 2020, 10:08am

Was nice to see an explanation for why trying with CPU was a good start:

In my case I now see:

“RuntimeError: index out of range: Tried to access index 20000 out of table with 19999 rows. at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418”

When running an LSTM as such:

class LSTMClassifier(nn.Module):

	"""
	USAGE:
		model = LSTMClassifier( HIDDEN_SIZE, INPUT_SIZE, VOCAB_SIZE )
		model.to( DEVICE )
	"""

	# initial setup of the RNN, ..
	# .. given user parameters, notice we have [at least] 3 layers: 
	# 		1. embedding, 
	#  		2. encoder [x N_LAYERS], 
	# 		3. predictor

	def __init__(self, hidden_size, embedding_dim, vocab_size, n_lstm_layers):  # bespoke @ANDY:@DEBUG:@1818 this may be causing a bug, below is default
	#def __init__(self, hidden_size, embedding_dim, vocab_size):  # ^
		super(LSTMClassifier, self).__init__()

		#self.embedding 	= nn.Embedding(vocab_size, embedding_dim) # @ANDY:@DEBUG:@1818 this leads to error: "RuntimeError: index out of range: Tried to access index 20000 out of table with 19999 rows. at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418", so a crude fix is the line below
		self.embedding 	= nn.Embedding(vocab_size, embedding_dim) # see ^
		self.encoder 	= nn.LSTM( 	input_size  = embedding_dim, 
									hidden_size = hidden_size,
								#	num_layers = 2) 
									num_layers  = n_lstm_layers)     # @ANDY:@DEBUG:@1818 this leads to error: "RuntimeError: index out of range: Tried to access index 20000 out of table with 19999 rows. at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418", so a crude fix is the line below
		#self.predictor 	= nn.Linear(hidden_size, N_OUT_CLASSES )  # bespoke @ANDY:@DEBUG:@1818 this may be causing a bug, below is default // arg1 = size of input, arg2 = number of output classes, see: https://pytorch.org/docs/stable/nn.html (CTRL+F: "nn.Linear")
		self.predictor 	= nn.Linear(hidden_size, 2 )  # ^ // arg1 = size of input, arg2 = number of output classes, see: https://pytorch.org/docs/stable/nn.html (CTRL+F: "nn.Linear")

		#self.flatten_parameters()


	# This is how the model makes predictions, 
	# .. given an input (training: u/ later to calculate losses & backprops )
	def forward( self, seq ):

		output, (hidden,_) 	= self.encoder(self.embedding(seq))
		preds 				= self.predictor(hidden.squeeze(0))  # e.g. remove 1D entries from the shape of an array, see: https://docs.scipy.org/doc/numpy/reference/generated/numpy.squeeze.html

		return preds

There are two places this 20000 is used:

Instantiating the model:

lstm_classifier = LSTMClassifier( HIDDEN_SIZE=150, INPUT_SIZE=300, vocab_size=20000, N_LAYERS=4 )

And when making the vocabulary, in the dataset generation phase:

VOCAB_SIZE = 20000
vocab_size = VOCAB_SIZE  # to restrict the vocabulary, which saves memory
TWEET.build_vocab(train, max_size = vocab_size)

Any help would greatly be appreciated!

Andrew_Brockman · March 19, 2020, 10:46am

I’ve come to learn what was going wrong…

When you build a vocabulary using torchtext.data.Label class:

from torchtext import data
print("Building vocabulary...")
TWEET = data.Field( tokenize="spacy", lower=True ) # https://spacy.io/usage/
vocab_size = 20000  # to restrict the vocabulary, which saves memory
TWEET.build_vocab(train, max_size = vocab_size)

Since I told it 20000 is the maximum vocab size, I would have expected the maximum sequence input would be 19999th element.

BUT

when you ask for the length of this vocabulary, it is always +2 from the maximum vocabulary size you ASKED it to restrict the vocab to:

In [12]: len(TWEET.vocab)                                                 
Out[12]: 20002

This is because two additional tokens are added, one of then is <unk> for unknown, etc.

So you need to tell your classifier this:

self.embedding = nn.Embedding(vocab_size +2, embedding_dim)

Full code for the LSTM class:

class LSTMClassifier(nn.Module):

	"""
	USAGE:
		model = LSTMClassifier( HIDDEN_SIZE, INPUT_SIZE, VOCAB_SIZE )
		model.to( DEVICE )
	"""

	# initial setup of the RNN, ..
	# .. given user parameters, notice we have [at least] 3 layers: 
	# 		1. embedding, 
	#  		2. encoder [x N_LAYERS], 
	# 		3. predictor

	def __init__(self, hidden_size, embedding_dim, vocab_size, n_lstm_layers):  # bespoke @ANDY:@DEBUG:@1818 this may be causing a bug, below is default
	#def __init__(self, hidden_size, embedding_dim, vocab_size):  # ^
		super(LSTMClassifier, self).__init__()

		#self.embedding 	= nn.Embedding(vocab_size, embedding_dim) # @ANDY:@DEBUG:@1818 this leads to error: "RuntimeError: index out of range: Tried to access index 20000 out of table with 19999 rows. at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418", so a crude fix is the line below
		self.embedding 	= nn.Embedding(vocab_size +2, embedding_dim) # see ^
		self.encoder 	= nn.LSTM( 	input_size  = embedding_dim, 
									hidden_size = hidden_size,
								#	num_layers = 2) 
									num_layers  = n_lstm_layers)     # @ANDY:@DEBUG:@1818 this leads to error: "RuntimeError: index out of range: Tried to access index 20000 out of table with 19999 rows. at /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418", so a crude fix is the line below
		#self.predictor 	= nn.Linear(hidden_size, N_OUT_CLASSES )  # bespoke @ANDY:@DEBUG:@1818 this may be causing a bug, below is default // arg1 = size of input, arg2 = number of output classes, see: https://pytorch.org/docs/stable/nn.html (CTRL+F: "nn.Linear")
		self.predictor 	= nn.Linear(hidden_size, 2 )  # ^ // arg1 = size of input, arg2 = number of output classes, see: https://pytorch.org/docs/stable/nn.html (CTRL+F: "nn.Linear")

		#self.flatten_parameters()


	# This is how the model makes predictions, 
	# .. given an input (training: u/ later to calculate losses & backprops )
	def forward( self, seq ):

		try:
			output, (hidden,_) 	= self.encoder(self.embedding(seq))
			preds 				= self.predictor(hidden.squeeze(0))  # e.g. remove 1D entries from the shape of an array, see: https://docs.scipy.org/doc/numpy/reference/generated/numpy.squeeze.html

		except:
			[max(seq[i]) for i in range(seq.shape[0])]
			pdb.set_trace()

		return preds

attardi · May 28, 2020, 11:31am

I run into the same error when using a model from Huggingface transformers (BertModel). The code runs fine on CPU:

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexT
ype>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, Ds
tDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [145,0,0], thread: [95,0,0] Assertion srcIndex < srcS electDimSize failed.
Traceback (most recent call last):
File “run.py”, line 58, in
cmd(args)
File “/project/piqasso/tools/biaffine-parser/parser/cmds/train.py”, line 82, in call
self.train(train.loader)
File “/project/piqasso/tools/biaffine-parser/parser/cmds/cmd.py”, line 83, in train
arc_scores, rel_scores = self.model(words, feats)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/project/piqasso/tools/biaffine-parser/parser/model.py”, line 90, in forward
feat_embed = self.feat_embed(*feats)
File “/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/project/piqasso/tools/biaffine-parser/parser/modules/bert.py”, line 43, in forward
bert = bert[bert_mask].split(bert_lens[mask].tolist())
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

How can I figure out what is the culprit?

LeonardoEmili · August 15, 2020, 6:10pm

I got the same error as @attardi when running on GPU the ‘bert-large-cased’ model from HuggingFace. Is anybody aware of a solution for this problem?

LeonardoEmili · August 16, 2020, 1:53pm

By chance did you find you what was causing the problem?

stas · October 8, 2020, 5:10am

Possibly related to https://github.com/pytorch/pytorch/issues/46020

LeonardoEmili · October 9, 2020, 3:17pm

It seems that there were some input instances there were exceeding the maximum number of workpiece embeddings that BERT could handle. What I did is simply check the dimensions of the input batches and pass as input to the model only those that were not exceeding that limit.

david.waterworth · December 15, 2020, 6:58am

I’ve been having this issue with RoBERTa. The problem was the max_position_embeddings parameter must be larger than the max_seq_length, otherwise the position embedding can generate indices > max_position_embeddings

widdiot · February 3, 2021, 1:42pm

Could also be due to some issue with the vocab file. As it was in my case