Training suddenly stops with IndexError: scatter_(): Expected dtype int64 for index

Allan_Jie · October 23, 2021, 2:30pm

It works prettt well until the 23rd epochs. not sure what is happening here

--training batch:  39%|███▉      | 999/2567 [03:57<08:04,  3.23it/s]epoch: 21, iteration: 1000, current mean loss: 0.17
--training batch:  78%|███████▊  | 1999/2567 [07:58<02:09,  4.38it/s]epoch: 21, iteration: 2000, current mean loss: 0.17
--training batch: 100%|██████████| 2567/2567 [10:12<00:00,  4.19it/s]Finish epoch: 21, loss: 426.59, mean loss: 0.17

--validation: 100%|██████████| 321/321 [00:43<00:00,  7.32it/s][Info] Acc.:71.52 
[Info] val Acc.:78.27 
--training batch:  39%|███▉      | 999/2567 [03:59<05:56,  4.39it/s]epoch: 22, iteration: 1000, current mean loss: 0.15
--training batch:  78%|███████▊  | 1999/2567 [07:52<02:03,  4.58it/s]epoch: 22, iteration: 2000, current mean loss: 0.14
--training batch: 100%|██████████| 2567/2567 [10:05<00:00,  4.24it/s]Finish epoch: 22, loss: 367.22, mean loss: 0.14

--validation: 100%|██████████| 321/321 [00:43<00:00,  7.35it/s][Info] Acc.:69.11 
[Info] val Acc.:74.63 

--training batch:  39%|███▉      | 999/2567 [04:01<06:07,  4.26it/s]epoch: 23, iteration: 1000, current mean loss: 0.13
--training batch:  78%|███████▊  | 1999/2567 [07:59<02:38,  3.58it/s]epoch: 23, iteration: 2000, current mean loss: 0.13
--training batch: 100%|█████████▉| 2566/2567 [10:13<00:00,  4.18it/s]
Traceback (most recent call last):
  File "universal_main.py", line 422, in <module>
    main()
  File "universal_main.py", line 398, in main
    constant_values=constant_values)
  File "universal_main.py", line 138, in train
    scaler.scale(loss).backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
IndexError: scatter_(): Expected dtype int64 for index.

soulitzer · October 23, 2021, 3:20pm

Seems quite weird, to see where it is coming from, can you try to run it again but with the TORCH_SHOW_CPP_STACKTRACES=1 environment variable enabled?

Enabling anomaly mode Automatic differentiation package - torch.autograd — PyTorch 1.10.0 documentation may also help.

Allan_Jie · October 23, 2021, 3:40pm

Sure., I can run the experiment again. It will take some time to run. Probably a few hours later. I will paste the results here.

I was also wondering if there are any issues with vanishing/exploding gradient issues.

Notes for the training process:

    for epoch in range(num_epochs):
        total_loss = 0
        model.train()
        for iter, feature in tqdm(enumerate(train_dataloader, 1), desc="--training batch", total=len(train_dataloader)):
            optimizer.zero_grad()
            with torch.cuda.amp.autocast(enabled=bool(config.fp16)):
                loss = model(.....).loss
            if config.fp16:
                scaler.scale(loss).backward()
                scaler.unscale_(optimizer)
            else:
                loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            total_loss += loss.item()
            if config.fp16:
                scaler.step(optimizer)
                scaler.update()
            else:
                optimizer.step()
            scheduler.step()
            model.zero_grad()

Allan_Jie · October 24, 2021, 2:49am

here comes the error


--training batch:  39%|███▉      | 999/2567 [03:57<06:04,  4.30it/s]epoch: 23, iteration: 1000, current mean loss: 0.13
--training batch:  78%|███████▊  | 1999/2567 [07:54<02:37,  3.61it/s]epoch: 23, iteration: 2000, current mean loss: 0.13
--training batch: 100%|█████████▉| 2566/2567 [10:07<00:00,  4.22it/s]
Traceback (most recent call last):
  File "universal_main.py", line 422, in <module>
    if __name__ == "__main__":
  File "universal_main.py", line 398, in main
    dev=conf.device, tokenizer=tokenizer, num_labels=num_labels,
  File "universal_main.py", line 138, in train
    scaler.scale(loss).backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
IndexError: scatter_(): Expected dtype int64 for index.
Exception raised from scatter_add_ at /pytorch/aten/src/ATen/native/TensorAdvancedIndexing.cpp:1082 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa0603bda22 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: at::native::scatter_add_(at::Tensor&, long, at::Tensor const&, at::Tensor const&) + 0x11b (0x7fa061ac337b in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0xbccd62 (0x7fa015338d62 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cu.so)
frame #3: at::redispatch::scatter_add_(c10::DispatchKeySet, at::Tensor&, long, at::Tensor const&, at::Tensor const&) + 0xb5 (0x7fa0620ddb15 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x350ddad (0x7fa063d49dad in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: at::redispatch::scatter_add_(c10::DispatchKeySet, at::Tensor&, long, at::Tensor const&, at::Tensor const&) + 0xb5 (0x7fa0620ddb15 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2fae5bb (0x7fa0637ea5bb in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: at::Tensor::scatter_add_(long, at::Tensor const&, at::Tensor const&) const + 0x151 (0x7fa062568bc1 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::gather_backward(at::Tensor const&, at::Tensor const&, long, at::Tensor const&, bool) + 0x87 (0x7fa061ad7b67 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x1b8556d (0x7fa0623c156d in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: at::gather_backward(at::Tensor const&, at::Tensor const&, long, at::Tensor const&, bool) + 0x194 (0x7fa061ef4a04 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::generated::GatherBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x1b7 (0x7fa063722837 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x355a3fa (0x7fa063d963fa in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1477 (0x7fa063d91c77 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x47b (0x7fa063d929fb in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7fa063d8ac49 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x53 (0x7fa1051b94b3 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0xbbb2f (0x7fa10639eb2f in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #18: <unknown function> + 0x7fa3 (0x7fa107ea1fa3 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7fa1079e84cf in /lib/x86_64-linux-gnu/libc.so.6)