Codet5 fails with a CUDA error

Juffin_Hally · January 14, 2022, 9:56am

I’m trying to reproduce the codet5 fine-tuning results (GitHub - salesforce/CodeT5: Code for CodeT5: a new code-aware pre-trained encoder-decoder model.)
The script being used is:

python3 /home/ubuntu/CodeT5/run_gen.py \
--task summarize \
--sub_task python \
--summary_dir /home/ubuntu/CodeT5/summary \
--cache_path /home/ubuntu/CodeT5/cache \
--data_dir /home/ubuntu/CodeT5/data \
--res_dir /home/ubuntu/CodeT5/res \
--output_dir /home/ubuntu/CodeT5/output \
--save_last_checkpoints \
--always_save_model \
--do_eval_bleu \
--model_name_or_path='Salesforce/codet5-base-multi-sum' \
--tokenizer_name='Salesforce/codet5-base-multi-sum' \
--train_filename /home/ubuntu/CodeT5/data/summarize/python/train.jsonl \
--dev_filename /home/ubuntu/CodeT5/data/summarize/python/valid.jsonl \
--test_filename /home/ubuntu/CodeT5/data/summarize/python/test.jsonl \
--do_train \
--do_eval \
--do_test \
--save_steps=500 \
--log_steps=100 \
--local_rank=-1

Running it leads to the following error:

Traceback (most recent call last):
  File "/home/ubuntu/CodeT5/run_gen.py", line 387, in <module>
    main()
  File "/home/ubuntu/CodeT5/run_gen.py", line 234, in main
    outputs = model(input_ids=source_ids, attention_mask=source_mask,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1561, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 998, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 639, in forward
    self_attention_outputs = self.layer[0](
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 546, in forward
    attention_output = self.SelfAttention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 483, in forward
    scores = torch.matmul(
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

My guess is something fishy is going on with the source_ids, but I haven’t been able to figure it out.

UPD: a simple test shows that source_ids has shape torch.Size([8, 64]), while target_ids have torch.Size([8, 64]).

ptrblck · January 16, 2022, 9:10pm

Is the sript working fine on the CPU using the same setup (i.e. input shapes etc.)?
If so, rerun the code with CUDA_LAUNCH_BLOCKING=1 python script.py args and check the stacktrace again, as it should point to the operation which is failing.

Juffin_Hally · January 18, 2022, 11:08am

It doesn’t, when run with an extra --no_cuda flag, it produces this error:

Traceback (most recent call last):
  File "/home/ubuntu/CodeT5/run_gen.py", line 394, in <module>
    main()
  File "/home/ubuntu/CodeT5/run_gen.py", line 241, in main
    outputs = model(input_ids=source_ids, attention_mask=source_mask,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1561, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 898, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2044, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

To me this further indicates that something is wrong with the inputs to the t5 model, probably the source_ids

ptrblck · January 18, 2022, 5:41pm

Yes, your guess seems to be correct and I would also check the inputs to torch.embedding to make sure the indices are correct.