Assertion `srcIndex < srcSelectDimSize` failed on Transformer Decoder

JaniceXiong · April 13, 2022, 2:23am

Hi, I built a fairseq transformer model and encountered the following assertion error when running my code on GPU. According to traceback, it seems to happen in Transformer Decoder “extract_features_scriptable” below.

if self.cross_self_attention or prev_output_tokens.eq(self.padding_idx).any():
            self_attn_padding_mask = prev_output_tokens.eq(self.padding_idx) # this line

And according to the training process, this error didn’t occur at the very beginning, but occurred in the batch sample 800 / 10000.

Could anyone give some advice on how to fix it or how to catch the assertion error so that I can analyze the error in detail? Thanks!

ptrblck · April 13, 2022, 4:07am

To get a better stacktrace which should point to the failing operation you could rerun your code with CUDA_LAUNCH_BLOCKING=1 python script.py args or use the CPU (might be too slow depending on the actual workload).
If you are sure the “Transformer Decoder” module raised this error, then check all indexing operations there and make sure the indexing tensor contains valid indices.

JaniceXiong · April 13, 2022, 4:28am

Thanks for replying!
Using CPU is too low and because the error did not occur at the very beginning, so there has been no result for a long time
I wonder why the eq operation can produce the index assertion error like the picture above?

ptrblck · April 13, 2022, 4:30am

Did you rerun the code with the blocking launches and was the eq operation still failing?
If not, then note that the stacktrace might be wrong due to the asynchronous execution of the CUDA kernels and you would need to rerun the script as mentioned before.

JaniceXiong · April 13, 2022, 3:24pm

Thanks. I reproduce the error using CPU, and it occurred in fairseq
Here I use the TransformerEncoder. And the src_tokens.shape is [3,333], and the src_select_index is tensor[0,1,2]

encoder_out = self.encoder(
            src_tokens,
            src_lengths=src_lengths,
            token_embeddings=token_embeddings,
            return_all_hiddens=return_all_hiddens
        )

But a strange shape error occurred in encoder_out. encoder_out["encoder_out"][0].shape is [333,3,1024] which is right, but encoder_out["encoder_states"][0].shape is [333,2,1024]. The batch size in dim 1 is different.
So when I use reorder_encoder_out function, shape error occurred.

facet_encoder_out = self.encoder.reorder_encoder_out(encoder_out, src_select_index)

Do you have any suggestion about it? I really could not find why it occurred

ptrblck · April 14, 2022, 4:10am

I’m unsure which module you are using exactly, as nn.TransfrmerEncoder doesn’t seem to return a dict with outputs:

encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
src = torch.rand(10, 32, 512)
out = transformer_encoder(src)
print(type(out))
# <class 'torch.Tensor'>