IndexError: index out of range in self in training LayoutLM

I was trying to train the layoutlm v1 by running the seq_labeling.py provided by the layoutlm with the following arguments

! python run_seq_labeling.py \
                            --data_dir "{dataset_dir}" \
                            --labels "{label_file}" \
                            --model_name_or_path "{pretrained_model_folder_input}" \
                            --model_type layoutlm \
                            --max_seq_length 512 \
                            --do_lower_case \
                            --do_train \
                            --num_train_epochs 1 \
                            --logging_steps 50 \
                            --save_steps -1 \
                            --output_dir output \
                            --overwrite_output_dir \
                            --per_gpu_train_batch_size 1 \
                            --per_gpu_eval_batch_size 1

However, I have encountered this error

Epoch:   0%|                                              | 0/1 [00:00<?, ?it/s]
Iteration:   0%|                                        | 0/560 [00:00<?, ?it/s]/opt/conda/lib/python3.7/site-packages/transformers/optimization.py:155: UserWarning: This overload of add_ is deprecated:
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /opt/conda/conda-bld/pytorch_1603729141890/work/torch/csrc/utils/python_arg_parser.cpp:882.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)

Iteration:   0%|                                | 1/560 [00:03<34:49,  3.74s/it]
Iteration:   0%|                                | 2/560 [00:07<33:22,  3.59s/it]
Iteration:   1%|▏                               | 3/560 [00:10<32:41,  3.52s/it]
Iteration:   1%|▏                               | 4/560 [00:13<31:29,  3.40s/it]
Iteration:   1%|▎                               | 5/560 [00:17<30:40,  3.32s/it]
Iteration:   1%|▎                               | 6/560 [00:20<29:58,  3.25s/it]
Iteration:   1%|▍                               | 7/560 [00:23<29:33,  3.21s/it]
Iteration:   1%|▍                               | 8/560 [00:26<30:55,  3.36s/it]
Iteration:   2%|▌                               | 9/560 [00:30<30:21,  3.30s/it]
Iteration:   2%|▌                              | 10/560 [00:33<29:46,  3.25s/it]
Iteration:   2%|▌                              | 11/560 [00:36<29:11,  3.19s/it]
Iteration:   2%|▋                              | 12/560 [00:39<28:53,  3.16s/it]
Iteration:   2%|▋                              | 13/560 [00:42<28:54,  3.17s/it]
Iteration:   2%|▊                              | 14/560 [00:45<29:45,  3.27s/it]
Epoch:   0%|                                              | 0/1 [00:45<?, ?it/s]
Traceback (most recent call last):
  File "run_seq_labeling.py", line 811, in <module>
    main()
  File "run_seq_labeling.py", line 704, in main
    args, train_dataset, model, tokenizer, labels, pad_token_label_id
  File "run_seq_labeling.py", line 219, in train
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/layoutlm/modeling/layoutlm.py", line 217, in forward
    head_mask=head_mask,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/layoutlm/modeling/layoutlm.py", line 171, in forward
    input_ids, bbox, position_ids=position_ids, token_type_ids=token_type_ids
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/layoutlm/modeling/layoutlm.py", line 82, in forward
    bbox[:, :, 2] - bbox[:, :, 0]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

How can I fix this error, or how can i find what is causing the error?

Based on the error message an nn.Embedding layer raises the error so make sure the inputs contain indices in the range [0, num_embeddings-1].

I tried to run it on a GPU T4, and also added some print statements

 Iteration:  25% 2/8 [00:01<00:03,  1.84it/s]
Input IDs - Min: tensor(0, device='cuda:0') Max: tensor(28522, device='cuda:0')
Attention Mask - Min: tensor(0, device='cuda:0') Max: tensor(1, device='cuda:0')
Token Type IDs - Min: tensor(0, device='cuda:0') Max: tensor(0, device='cuda:0')
BBox - Min: tensor(0, device='cuda:0') Max: tensor(1000, device='cuda:0')

Iteration:  38% 3/8 [00:01<00:01,  2.60it/s]
Input IDs - Min: tensor(0, device='cuda:0') Max: tensor(29656, device='cuda:0')
Attention Mask - Min: tensor(0, device='cuda:0') Max: tensor(1, device='cuda:0')
Token Type IDs - Min: tensor(0, device='cuda:0') Max: tensor(0, device='cuda:0')
BBox - Min: tensor(0, device='cuda:0') Max: tensor(1000, device='cuda:0')
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [279,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [279,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [279,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

...

../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [274,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [274,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Iteration:  38% 3/8 [00:01<00:02,  1.96it/s]
Epoch:   0% 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "run_seq_labeling.py", line 821, in <module>
    main()
  File "run_seq_labeling.py", line 714, in main
    args, train_dataset, model, tokenizer, labels, pad_token_label_id
  File "run_seq_labeling.py", line 229, in train
    outputs = model(**inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/layoutlm/modeling/layoutlm.py", line 224, in forward
    head_mask=head_mask,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/layoutlm/modeling/layoutlm.py", line 178, in forward
    input_ids, bbox, position_ids=position_ids, token_type_ids=token_type_ids
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/layoutlm/modeling/layoutlm.py", line 105, in forward
    embeddings = self.dropout(embeddings)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: philox_cuda_state for an unexpected CUDA generator used during capture. In regions captured by CUDA graphs, you may only use the default CUDA RNG generator on the device that's current when capture begins. If you need a non-default (user-supplied) generator, or a generator on another device, please file an issue.

Now, Im not sure what is causing the error.

An indexing kernel is still failing, which still seems to be raised by the embedding layer.

does it mean that the problem lies in my inputs?

I have added a print statement:

for key, value in inputs.items():
    if len(value) > 0 and value.min() < 0:
        print(f"Warning: {key} index out of range: {value.min()}")
        continue

This is the output

Iteration:   0% 1/280 [00:01<07:19,  1.58s/it]Attention Mask - Min: tensor(0, device='cuda:0') Max: tensor(1, device='cuda:0')

Token Type IDs - Min: tensor(0, device='cuda:0') Max: tensor(0, device='cuda:0')

BBox - Min: tensor(0, device='cuda:0') Max: tensor(1000, device='cuda:0')

Warning: labels index out of range: -100

Position embeddings size: torch.Size([1024, 768])

Maximum index value: tensor(28522, device='cuda:0')
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [171,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [171,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [171,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [171,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Does it mean the problem is in the index of labels? If so, how can I fix it? Sorry I’m a newbie

Yes, the indexing tensor contains values which are out of bounds and thus create the indexing error.
In particular your labels tensor contains an invalid -100 value, so clip the input tensor to [0, num_embeddings-1].

I have added a code where I check for out of range indices and then if the tensor is the "labels" tensor, it is clamped to ensure its values fall within a valid range.

# snippet of code in training the model
        for step, batch in enumerate(epoch_iterator):
            model.train()

            # Check if the input indices are within range
            if batch[0].size(0) <= 0:
                continue

            try:
                inputs = {
                    "input_ids": batch[0].to(args.device),
                    "attention_mask": batch[1].to(args.device),
                    "labels": batch[3].to(args.device),
                }
                if args.model_type in ["layoutlm"]:
                    inputs["bbox"] = batch[4].to(args.device)
                inputs["token_type_ids"] = (
                    batch[2].to(args.device) if args.model_type in ["bert", "layoutlm"] else None
                ) # RoBERTa don"t use segment_ids

                # Print the minimum and maximum values of each input
                print("\nInput IDs - \nMin:", torch.min(inputs["input_ids"]), "\nMax:", torch.max(inputs["input_ids"]))
                print("Attention Mask - Min:", torch.min(inputs["attention_mask"]), "Max:", torch.max(inputs["attention_mask"]))
                if "token_type_ids" in inputs:
                    print("Token Type IDs - Min:", torch.min(inputs["token_type_ids"]), "Max:", torch.max(inputs["token_type_ids"]))
                if "bbox" in inputs:
                    print("BBox - Min:", torch.min(inputs["bbox"]), "Max:", torch.max(inputs["bbox"]))

                # Check for out-of-range indices
                for key, value in inputs.items():
                    if len(value) > 0 and value.min() < 0:
                        print(f"Warning: {key} index out of range: {value.min()}")
                        if key == "labels":
                            valid_range = list(range(len(labels)))
                            inputs[key] = torch.clamp(value, min=0, max=max(valid_range))


            except RuntimeError as e:
                if any(error_msg in str(e) for error_msg in ["indexSelectLargeIndex", "Assertion `srcIndex < srcSelectDimSize` failed."]):
                    print("Error: Assertion failed in CUDA indexing. Skipping batch.")
                else:
                    # Handle other runtime errors
                    print("Error:", e)

However, the same error still occurs:

Epoch:   0% 0/1 [00:00<?, ?it/s]
Iteration:   0% 0/70 [00:00<?, ?it/s]
Input IDs - 
Min: tensor(0, device='cuda:0') 
Max: tensor(29656, device='cuda:0')
Attention Mask - Min: tensor(0, device='cuda:0') Max: tensor(1, device='cuda:0')
Token Type IDs - Min: tensor(0, device='cuda:0') Max: tensor(0, device='cuda:0')
BBox - Min: tensor(0, device='cuda:0') Max: tensor(1000, device='cuda:0')
Warning: labels index out of range: -100
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [66,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [66,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

...

../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [232,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [232,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Iteration:   0% 0/70 [00:00<?, ?it/s]
Epoch:   0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_seq_labeling.py", line 846, in <module>
    main()
  File "run_seq_labeling.py", line 739, in main
    args, train_dataset, model, tokenizer, labels, pad_token_label_id
  File "run_seq_labeling.py", line 254, in train
    outputs = model(**inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/layoutlm/modeling/layoutlm.py", line 224, in forward
    head_mask=head_mask,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/layoutlm/modeling/layoutlm.py", line 178, in forward
    input_ids, bbox, position_ids=position_ids, token_type_ids=token_type_ids
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/layoutlm/modeling/layoutlm.py", line 76, in forward
    upper_position_embeddings = self.y_position_embeddings(bbox[:, :, 1])
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 162, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

It seems your labels still contains invalid indices so you would need to debug why that’s the case and if this value is set after the clamp operation was performed.