HCTensorMathPointwise.cu line=464 error=59 : device-side assert triggered

jpainam · May 29, 2019, 9:47am

Hi everyone, i have this error when training my network. I don’t see where is the error in my code

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=464 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "train.py", line 213, in <module>
    train()
  File "train.py", line 154, in train
    loss_id.backward()
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:464

The train() function

model = Model(pretrained=True)
model = model.cuda()
for i, (inputs, targets) in enumerate(train_loader):
     inputs = inputs.cuda()
     inputs = Variable(inputs)
     outputs = model(inputs)
     targets = targets.cuda()
     targets = Variable(targets)
     loss_id = criterion(ipnuts, targets)
     optimizer.zero_grad()
     loss_id.backward()
     optimizer.step()

Thank you for helping me debug this

jpainam · May 29, 2019, 10:03am

Running with CUDA_LAUNCH_BLOCKING=1 gives this output:

/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=111 error=59 : device-side assert triggered
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x2b78253c1630>>
Traceback (most recent call last):
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 399, in __del__
    self._shutdown_workers()
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
    self.worker_result_queue.get()
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
    fd = df.detach()
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 494, in Client
    deliver_challenge(c, authkey)
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "train.py", line 213, in <module>
    train()
  File "train.py", line 140, in train
    loss_id = criterion(outputs, targets)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 862, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1550, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1407, in nll_loss
    return torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:111

ptrblck · May 29, 2019, 11:21am

Could you check the target values, since they are supposed to be in the range [0, nb_classes-1].
Maybe they are outside of this range, which will throw an error.
Also, you might want to run the code on CPU to get a (hopefully better) error message.

jpainam · May 29, 2019, 11:55am

Thank you for your help
I printed the targets values for the first epoch

tensor([111, 695, 202, 720, 217, 434, 471, 625, 379, 201,   3, 273, 469, 147,
        180, 141, 378, 326, 418, 725,  32, 225, 176, 297, 452, 328, 264, 208,
        441, 689, 580, 155], device='cuda:0')
tensor([260, 129, 505, 586, 529, 662, 570, 289, 391, 702, 433, 228, 139,  66,
         52, 732, 678, 437, 619, 526, 369, 100, 709, 181, 402, 460, 231, 492,
        516, 374, 440, 299], device='cuda:0')
tensor([157, 412, 149, 651, 669,   5,  31, 632, 680, 145, 103, 478, 213, 115,
        576, 738, 253,  17, 182, 327, 464, 252, 590, 706,  51, 238,   6, 553,
        107, 302, 240, 721], device='cuda:0')
tensor([532, 491, 693, 664, 567, 530, 135, 150, 515, 436, 119, 300, 655, 628,
        604,  23, 683, 332, 204, 481, 387, 315, 153, 750, 518, 456, 222, 511,
        200, 218,  36, 271], device='cuda:0')
tensor([739, 631, 599, 110, 601, 746, 183, 400, 520,  91, 169,   2, 239, 422,
        286, 455, 639, 483, 280, 479, 741, 340,  59, 524, 510, 396, 211, 144,
        419, 508, 577, 288], device='cuda:0')
tensor([125,  39,  63,  96, 519, 366, 463, 417,  55,  81, 574, 350, 197, 684,
         18, 334, 428, 597, 185, 338, 295, 493, 215,  76, 159, 488, 188, 536,
        555, 710,  38, 408], device='cuda:0')
tensor([283, 151,   9, 548, 486,  82, 186, 236, 652, 623, 170,  45, 626, 457,
        504, 609, 514, 281, 163, 318, 591, 177, 560, 672, 588, 267, 700, 542,
        195, 444, 256, 635], device='cuda:0')
tensor([353, 311, 376, 371, 677,  14, 321, 320, 282,  72, 291, 665, 543, 259,
        551, 584, 733,  97, 301,  86, 679, 410, 600,  48,  19, 336, 189, 317,
        346, 716, 610, 735], device='cuda:0')
tensor([ 24, 223, 521, 257, 382, 569, 415, 587, 290,  69, 137, 312, 337, 734,
        232, 160, 717, 503, 713, 612,  21, 199, 158, 643, 575, 620, 594, 388,
        210,  64, 568, 556], device='cuda:0')
tensor([262,  79, 657,  92, 461,  54, 166, 187, 429, 308,   1, 485, 666,  42,
        496, 352,  37, 497, 675, 356, 203, 431, 482, 697, 243, 547, 305, 196,
        468, 314, 432,   8], device='cuda:0')
tensor([644,  11, 737, 322, 355, 383, 581,  95, 642, 335,  56, 743, 636, 686,
        659, 673, 397, 138, 292,  50,  98, 613, 279,  78, 358, 132, 549, 707,
        633, 681, 194, 390], device='cuda:0')
tensor([389, 454, 539, 540, 130, 480, 624, 426, 233, 167, 362, 459, 660, 123,
         88,  71, 699, 685, 435, 172, 219, 255, 127, 365, 656, 668, 269, 490,
        618, 310, 101,  44], device='cuda:0')
tensor([650, 152, 105, 506, 663, 168, 298, 270, 641, 142, 611, 446, 489, 647,
        407, 126, 607, 670,  70, 373, 557, 726, 598, 209,  87, 494, 227, 339,
        319, 714, 304, 272], device='cuda:0')
tensor([416, 525, 124, 722, 424, 671, 140,  35, 276, 667, 254, 622, 544, 742,
        582, 731,  58, 381,  74, 592, 438, 285, 749, 361, 206, 370, 487,  65,
        447, 134, 747, 414], device='cuda:0')
tensor([661, 349, 637, 698, 629, 676, 405,  10, 104, 143, 708, 694, 701, 146,
         89, 306, 303, 224, 534, 404, 451, 559, 207, 263,  43, 242, 538, 646,
        727, 386, 658, 385], device='cuda:0')
tensor([248, 357,  90, 564, 120, 164, 498, 608, 522, 472, 220, 736, 345, 509,
        293, 212, 234, 325, 616, 606, 589, 393, 528, 348, 102, 347,  61, 423,
        705, 718, 696, 596], device='cuda:0')
tensor([719, 343,  99, 192, 499, 674, 740, 723, 284, 122, 430, 501, 605, 109,
         77, 113, 579, 537, 112, 171, 333, 545, 561,  85,  34, 341,  84, 406,
        654,  20, 323, 316], device='cuda:0')
Time taken: 14.64 sec.

-----------------------------------
Epoch: [2/150]
tensor([ 57, 159, 486, 182,  69, 625, 607, 370, 742, 559, 406, 271, 174, 173,
        347, 338, 427, 743,  62,  74, 396, 185, 693, 603, 168, 731, 462, 266,
         32, 143,  70, 122], device='cuda:0')
tensor([ 68, 386, 648,  54, 456, 301, 366, 536, 204, 732, 523, 748, 481, 653,
        435, 468, 155, 272, 141,  50, 690, 516, 440, 264, 429, 639, 258,  41,
        539, 545, 205, 333], device='cuda:0')

No target value is less than 0 or greater than 750. I have 751 classes.
I don’t understand how criterion will know about n_classes. It’s neither defined during the initilization, criterion = nn.CrossEntropyLoss() and the error happens during the first call of loss = criterion(outputs, targets)
On CPU, it gives this error

  File "train.py", line 142, in train
    loss_id = criterion(outputs, targets)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 862, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1550, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1407, in nll_loss
    return torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THNN/generic/ClassNLLCriterion.c:93

ptrblck · May 29, 2019, 12:05pm

The number of classes is defined by the shape of your model’s output.
E.g. an output of the shape [batch_size, 10] will correspond to 10 different classes.
Try to put an assert statement or any other check inside your training loop to check for invalid values in your target.

jpainam · May 29, 2019, 12:21pm

ok, thank you. my mistake, the output of the model was (batch_size, 512). I had to change it to 751
Thank you