Trying to understand a C error: torch.autograd.detect_anomaly() magically removes the error

pputzky · November 28, 2019, 1:25pm

Hi,

I have a weird error that I don’t quite know what to do with. After training a model for several hours without problems, I receive the following error:

terminate called after throwing an instance of ‘std::out_of_range’
what(): vector::_M_range_check: __n (which is 1903610121) >= this->size() (which is 5)
Aborted

This error is reproducible for different users using Pytorch 1.1 or Pytorch 1.3 (my config: CUDA 10.1, cuDNN 7.5, driver 418.39, NVIDIA Tesla V100).

Not being able to make any sense of the error, I tried to find the issue by using

with torch.autograd.detect_anomaly():

Interestingly, the error disappeared. My models have been running over night without issue. So, my first question is: What kind of settings does torch.autograd.detect_anomaly() change that could affect this issue?

Of course, I would like to find a permanent solution by getting to the root of the problem. I have implemented my own autograd Function here:

github.com

pputzky/invertible_rim/blob/master/irim/core/invert_to_learn.py

from abc import ABC, abstractmethod
import itertools

import torch
from torch import nn
from torch.autograd import Function


class InvertToLearnFunction(Function):
    @staticmethod
    def forward(ctx, n_args, layer, forward_fun, reverse_fun, args, kwargs, *tensors):

        if n_args == 1:
            x = tensors[0]
        else:
            x = list(tensors[:2 * n_args:2])

        with torch.no_grad():
            y = forward_fun(x, *args, **kwargs)

This file has been truncated. show original

I have isolated the problem so far that I can say it must either have to do with InvertToLearnFunction.backward or with InvertibleLayer.gradfun, but that’s about it. This has proven really difficult to debug as the error appears after several hours of successfully running. Any insights or suggestions would be highly appreciated.

ptrblck · November 30, 2019, 7:21am

These errors are quite hard to debug, as you probably already experienced.
I assume detect_anomaly won’t help much regarding your error, since the actual error seems to point to an exception from std::vector.

Since detect_anomaly will most likely slow down your code, you might have “stabilized” the code by removing e.g. race conditions, but it’s just a wild guess.

Could you try to rerun the code without detect_anomaly, store each batch, and wait until the error is triggered again?
This could give us a reproducible code snippet, so that we could trigger this error in a single step.

pputzky · December 1, 2019, 8:49am

Thanks for the response!

You are correct, detect_anomaly more than doubles computation time. My first question was aiming at figuring out how this function could stabilize training, i.e. which settings are changed? Is there a documentation anywhere that describes the settings this function adjusts? This could help me find an intermediate solution without the whole bunch of overhead from detect_anomaly.

Could you explain to me what you mean exactly by storing each batch?

pputzky · December 8, 2019, 11:12pm

Hi,

I am still trying to resolve this issue without success. To make things easier, I have attached an example code at the bottom which will demonstrate the problem (please note the dependency on my library). Unfortunately, even with setting the seed and CUDNN to deterministic, there is no way to exactly reproduce after how many iterations the error will appear. Sometimes it takes 2000 updates, sometimes it will only break after 30000 or more updates (as below). Please note that in my example, I am always using the same batch. Again, I have confirmed that torch.set_anomaly_enabled(True) will make the issue disappear.

gdb gives me the following backtrace:

Iteration 49800 : loss = 0.6797990798950195 accuracy = 0.5477447509765625
Iteration 49900 : loss = 0.6751381754875183 accuracy = 0.5528717041015625
terminate called after throwing an instance of ‘std::out_of_range’
what(): vector::_M_range_check: __n (which is 18446744073531805265) >= this->size() (which is 3)

Thread 8 “python” received signal SIGABRT, Aborted.
[Switching to Thread 0x1554f567f700 (LWP 8711)]
__GI_raise (sig=sig@entry=6) at …/sysdeps/unix/sysv/linux/raise.c:51
51 …/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at …/sysdeps/unix/sysv/linux/raise.c:51
#1 0x0000155554f7b801 in __GI_abort () at abort.c:79
#2 0x0000155546032957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x0000155546038ab6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x0000155546038af1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00001555472e917e in execute_native_thread_routine () from $VENVPATH/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#6 0x0000155554d236db in start_thread (arg=0x1554f567f700) at pthread_create.c:463
#7 0x000015555505c88f in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:95

I don’t quite know what to do with this backtrace myself, but hopefully you could make some sense out of it. Do you have any suggestion how I could proceed from here?

import torch

from irim import InvertibleUnet
from irim import MemoryFreeInvertibleModule

torch.manual_seed(0)
torch.backends.cudnn.enabled = True
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Use CUDA if devices are available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ---- Parameters ---
# Working with images, for time series or volumes set to 1 or 3, respectively
conv_nd = 2
# Number of Householder projections for constructing 1x1 convolutions
n_householder = 3
# Number of channels for each layer of the Invertible Unet
n_channels = [3]
# Number of hidden channel in the residual functions of the Invertible Unet
n_hidden = [16]
# Downsampling factors
dilations = [1]
# Number of IRIM steps
n_steps = 1
# Number of image channels
im_channels = 3
# Number of total samples
n_samples = 64
im_size = 32
learning_rate = 1e-3

# Construct Invertible Unet
model = InvertibleUnet(n_channels=n_channels,n_hidden=n_hidden,dilations=dilations,
                       conv_nd=conv_nd, n_householder=n_householder)

# Wrap the model for Invert to Learn
model = MemoryFreeInvertibleModule(model)

# Move model to CUDA device if possible
model.to(device)

# Use data parallel if possible
#if torch.cuda.device_count() > 1:
#    model = torch.nn.DataParallel(model)

optimizer = torch.optim.Adam(model.parameters(), learning_rate)

# Input data drawn form a standard normal
x_in = torch.randn(n_samples,im_channels,*[im_size]*conv_nd, device=device)
# Binary labels for each sample
y_in = torch.empty(n_samples,1,*[im_size]*conv_nd, device=device).random_(2)

torch.set_anomaly_enabled(False)
for i in range(300000):
  optimizer.zero_grad()
  model.zero_grad()
  # Forward computation
  y_est = model.forward(x_in)
  # We use the first channel for prediction
  y_est = y_est[:,:1]
  loss = torch.nn.functional.binary_cross_entropy_with_logits(y_est, y_in)
  loss.backward()

  optimizer.step()

  if i % 100 == 0:
    y_est = (y_est >= 0.).float()
    accuracy = torch.mean((y_est == y_in).float())
    print('Iteration', i, ': loss =',loss.item(), 'accuracy = ', accuracy.item())

XiaoAHeng · December 9, 2019, 9:30am

I have met the same error terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 2103833553) >= this->size() (which is 5) after 2 epoch,
can you offer a radical solution? Thanks in advance!