Data become NAN after fed into GPUs

ginobilinie · April 24, 2020, 5:34am

When I try to feed data to GPUs, I found sometimes the img data become NAN and the label data sometimes become -1, the code is below, I also paste the output of this copy of code:

        minibatch = dataloader.next()
        imgs = minibatch['data']
        gts = minibatch['label']

        if torch.sum(torch.isnan(imgs).any()).item()>0:
            print('There is NAN in imgs before sent to cuda')
        else:
            print('There is no NAN in imgs before sent to cuda')

        print('imgs:',np.amax(imgs.numpy()),np.amin(imgs.numpy()))
        print('gts:',np.unique(gts.numpy()))
        print('gts:',type(gts))
        if engine.distributed:
        # for multiple-gpu
            imgs = imgs.cuda(non_blocking=True)
            gts = gts.cuda(non_blocking=True)
            print('gts tensor on cuda:',torch.unique(gts))
            if torch.sum(torch.isnan(imgs).any()).item()>0:
                print('There is NAN in imgs on cuda')

            #print('no move to cuda')

Output:
There is no NAN in imgs before sent to cuda
There is no NAN in imgs before sent to cuda
There is no NAN in imgs before sent to cuda
There is no NAN in imgs before sent to cuda
imgs: 0.594 -0.46539214
imgs: 0.594 -0.46539214
imgs: 0.594 -0.44970587
imgs: 0.594 -0.46147057
gts: [ 0 1 2 4 5 8 9 10 11 13 18 255]
gts: <class ‘torch.Tensor’>
gts: [ 0 1 2 5 7 8 10 11 13 15 18 255]
gts: <class ‘torch.Tensor’>
gts tensor on cuda: tensor([ 0, 1, 2, 4, 5, 8, 9, 10, 11, 13, 18, 255],
device=‘cuda:3’)
gts tensor: tensor([ 0, 1, 2, 4, 5, 8, 9, 10, 11, 13, 18, 255],
device=‘cuda:3’)
gts: [ 0 1 2 3 5 6 7 8 9 11 13 18 255]
gts: <class ‘torch.Tensor’>
gts tensor on cuda: tensor([ 0, 1, 2, 5, 7, 8, 10, 11, 13, 15, 18, 255],
device=‘cuda:1’)
There is NAN in imgs on cuda
gts tensor: tensor([ 0, 1, 2, 5, 7, 8, 10, 11, 13, 15, 18, 255],
device=‘cuda:1’)

Environment:
python3.6
torch 1.1.0
torchvision 0.3.0
cuda: 9.0

ptrblck · April 24, 2020, 8:15am

Could you update to the latest stable release (1.5) and rerun the code?
Also, could you post some information about your system, e.g. which GPU(s) are you using?
We had a similar issue some time ago and if I’m not mistaken, it turned out to be a hardware failure.

ginobilinie · April 24, 2020, 7:07pm

@ptrblck
Thank you for your reply.

I’m using Tesla V100 GPU on Ubuntu System.

I’m trying a different version of pytorch.

ginobilinie · April 26, 2020, 7:37am

@ptrblck I solve it by rebooting the machine.