I appear to be getting a device side assert from an embedding lookup, but (with CUDA_LAUNCH_BLOCKING enabled) the exception is raised for the next instruction that touches CUDA (in my case, it’s a call to
torch.zeros(..., device='cuda')), several python instructions later.
entailed_embeds_calculated = [self.relation_embedding(index_tensor) for index_tensor in entailed_pred_indices] # here self.relation_embedding is an nn.Embedding, and entailed_pred_indices is a list of index tensors. # I checked, and an element of an index_tensor is out of bounds. entailed_embeds_calculated = self.aggregator(entailed_embeds_calculated, entailed_scores) # entailed_scores is also a list of tensors ... def aggregator(embedding_lists, weights=None): # a few asserts and other non-CUDA python code embed_dim = embedding_lists.shape[-1] zero_tensor = torch.zeros((embed_dim,), dtype=torch.float, device=embedding_lists.device) # At this point the "Device side assert" is raised.
A similar behaviour appears to be happening here.
The CUDA error looks like this:
/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/THCTensorIndex.cu:307: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed. File ".../RelationScorers.py", line 1072, in aggregator zero_tensor = torch.zeros((embed_dim,), dtype=torch.float, device=embedding_lists.device) RuntimeError: CUDA error: device-side assert triggered ...