Device side assert from embedding lookup raises on subsequent CUDA connected instruction

I appear to be getting a device side assert from an embedding lookup, but (with CUDA_LAUNCH_BLOCKING enabled) the exception is raised for the next instruction that touches CUDA (in my case, it’s a call to torch.zeros(..., device='cuda')), several python instructions later.

entailed_embeds_calculated = [self.relation_embedding(index_tensor) for index_tensor in entailed_pred_indices]
# here self.relation_embedding is an nn.Embedding, and entailed_pred_indices is a list of index tensors.
# I checked, and an element of an index_tensor is out of bounds.

entailed_embeds_calculated = self.aggregator(entailed_embeds_calculated, entailed_scores)
# entailed_scores is also a list of tensors


def aggregator(embedding_lists, weights=None):
    # a few asserts and other non-CUDA python code
    embed_dim = embedding_lists[0].shape[-1]
    zero_tensor = torch.zeros((embed_dim,), dtype=torch.float, device=embedding_lists[0].device)
    # At this point the "Device side assert" is raised.

A similar behaviour appears to be happening here.

The CUDA error looks like this:

/opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/THC/ void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

  File ".../", line 1072, in aggregator
    zero_tensor = torch.zeros((embed_dim,), dtype=torch.float, device=embedding_lists[0].device)
RuntimeError: CUDA error: device-side assert triggered

Are you setting the CUDA_LAUNCH_BLOCKING=1 env variable in your terminal or are you trying to set it in your script/notebook?
In the latter case, please use it as an external env variable, as setting it via os inside a script will fail if any libraries were imported and have already initialized CUDA.

I set it before launching python (ie: in the env) and reported the setting in python just to be sure.

I do see a difference when it’s not set: the exception is raised a few instructions later. So it seems to be doing something. It consistently raised at the same point (with CUDA_LAUNCH_BLOCKING set) for several runs and with differing GPU loads.

I’m running pytorch 1.4.0 (old for compatibility with HPC infrastructure) and CUDA Version 10.2.89. These aren’t the most recent, but I put this here partly for others who may encounter this odd behaviour to help them decifer it :slight_smile:

Could you update to the latest stable release and recheck this behavior?
In your old version, device assert might have been missing, which would explain this behavior.
In particular, this was the case in 1.5.0, as device assertions were globally disabled (and fixed in 1.5.1).
I don’t recall, if there were similar issues in 1.4.0.