RuntimeError: CUDA error: device-side assert triggered on finetuning LayoutLMv3

Hello. I was training my dataset on the LayoutLMv3 model when this error occurs.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[32], line 2
      1 # Initialize our Trainer
----> 2 trainer = Trainer(
      3     model=model,
      4     args=training_args,
      5     train_dataset=train_dataset,
      6     eval_dataset=eval_dataset,
      7     tokenizer=processor,
      8     data_collator=default_data_collator,
      9     compute_metrics=compute_metrics,
     10 )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:337, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
    335 self.args = args
    336 # Seed must be set before instantiating the model when using model
--> 337 enable_full_determinism(self.args.seed) if self.args.full_determinism else set_seed(self.args.seed)
    338 self.hp_name = None
    339 self.deepspeed = None

File /opt/conda/lib/python3.10/site-packages/transformers/trainer_utils.py:95, in set_seed(seed)
     93 np.random.seed(seed)
     94 if is_torch_available():
---> 95     torch.manual_seed(seed)
     96     torch.cuda.manual_seed_all(seed)
     97     # ^^ safe to call this function even if cuda is not available

File /opt/conda/lib/python3.10/site-packages/torch/random.py:40, in manual_seed(seed)
     37 import torch.cuda
     39 if not torch.cuda._is_in_bad_fork():
---> 40     torch.cuda.manual_seed_all(seed)
     42 import torch.mps
     43 if not torch.mps._is_in_bad_fork():

File /opt/conda/lib/python3.10/site-packages/torch/cuda/random.py:113, in manual_seed_all(seed)
    110         default_generator = torch.cuda.default_generators[i]
    111         default_generator.manual_seed(seed)
--> 113 _lazy_call(cb, seed_all=True)

File /opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:183, in _lazy_call(callable, **kwargs)
    181 def _lazy_call(callable, **kwargs):
    182     if is_initialized():
--> 183         callable()
    184     else:
    185         # TODO(torch_deploy): this accesses linecache, which attempts to read the
    186         # file system to get traceback info. Patch linecache or do something
    187         # else here if this ends up being important.
    188         global _lazy_seed_tracker

File /opt/conda/lib/python3.10/site-packages/torch/cuda/random.py:111, in manual_seed_all.<locals>.cb()
    109 for i in range(device_count()):
    110     default_generator = torch.cuda.default_generators[i]
--> 111     default_generator.manual_seed(seed)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

My guess is that there is something wrong with the dataset, as I have tried to use 10 images first, and it went without errors. However when I tried 700+ images, it returns this error.

This is my trainer.

# Initialize our Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics,
)

Rerun your code via CUDA_LAUNCH_BLOCKING=1 python setup.py args and check the stacktrace to isolate which operation fails.

I was not yet able to run it through CUDA_LAUNCH_BLOCKING=1, but I tried to run it on CPU. This is the result.

File /opt/conda/lib/python3.10/site-packages/transformers/models/layoutlmv3/modeling_layoutlmv3.py:269, in LayoutLMv3TextEmbeddings.calculate_spatial_position_embeddings(self, bbox)
    267     lower_position_embeddings = self.y_position_embeddings(bbox[:, :, 3])
    268 except IndexError as e:
--> 269     raise IndexError("The `bbox` coordinate values should be within 0-1000 range.") from e
    271 h_position_embeddings = self.h_position_embeddings(torch.clip(bbox[:, :, 3] - bbox[:, :, 1], 0, 1023))
    272 w_position_embeddings = self.w_position_embeddings(torch.clip(bbox[:, :, 2] - bbox[:, :, 0], 0, 1023))

IndexError: The `bbox` coordinate values should be within 0-1000 range.

If my understanding is correct, this bbox value STORE#2798 14 1010 84 1058 is invalid or causing the error?

Yes, your explanation sounds reasonable and indeed these out-of-bounds values could trigger the device assert on your GPU.