HCTensorMathPointwise.cu line=464 error=59 : device-side assert triggered

Hi everyone, i have this error when training my network. I don’t see where is the error in my code

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=464 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "train.py", line 213, in <module>
    train()
  File "train.py", line 154, in train
    loss_id.backward()
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:464

The train() function

model = Model(pretrained=True)
model = model.cuda()
for i, (inputs, targets) in enumerate(train_loader):
     inputs = inputs.cuda()
     inputs = Variable(inputs)
     outputs = model(inputs)
     targets = targets.cuda()
     targets = Variable(targets)
     loss_id = criterion(ipnuts, targets)
     optimizer.zero_grad()
     loss_id.backward()
     optimizer.step()

Thank you for helping me debug this

Running with CUDA_LAUNCH_BLOCKING=1 gives this output:

/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=111 error=59 : device-side assert triggered
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x2b78253c1630>>
Traceback (most recent call last):
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 399, in __del__
    self._shutdown_workers()
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
    self.worker_result_queue.get()
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
    fd = df.detach()
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 494, in Client
    deliver_challenge(c, authkey)
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge
    response = connection.recv_bytes(256)        # reject large message
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/fstu1/miniconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "train.py", line 213, in <module>
    train()
  File "train.py", line 140, in train
    loss_id = criterion(outputs, targets)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 862, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1550, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1407, in nll_loss
    return torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:111

Could you check the target values, since they are supposed to be in the range [0, nb_classes-1].
Maybe they are outside of this range, which will throw an error.
Also, you might want to run the code on CPU to get a (hopefully better) error message.

Thank you for your help
I printed the targets values for the first epoch

tensor([111, 695, 202, 720, 217, 434, 471, 625, 379, 201,   3, 273, 469, 147,
        180, 141, 378, 326, 418, 725,  32, 225, 176, 297, 452, 328, 264, 208,
        441, 689, 580, 155], device='cuda:0')
tensor([260, 129, 505, 586, 529, 662, 570, 289, 391, 702, 433, 228, 139,  66,
         52, 732, 678, 437, 619, 526, 369, 100, 709, 181, 402, 460, 231, 492,
        516, 374, 440, 299], device='cuda:0')
tensor([157, 412, 149, 651, 669,   5,  31, 632, 680, 145, 103, 478, 213, 115,
        576, 738, 253,  17, 182, 327, 464, 252, 590, 706,  51, 238,   6, 553,
        107, 302, 240, 721], device='cuda:0')
tensor([532, 491, 693, 664, 567, 530, 135, 150, 515, 436, 119, 300, 655, 628,
        604,  23, 683, 332, 204, 481, 387, 315, 153, 750, 518, 456, 222, 511,
        200, 218,  36, 271], device='cuda:0')
tensor([739, 631, 599, 110, 601, 746, 183, 400, 520,  91, 169,   2, 239, 422,
        286, 455, 639, 483, 280, 479, 741, 340,  59, 524, 510, 396, 211, 144,
        419, 508, 577, 288], device='cuda:0')
tensor([125,  39,  63,  96, 519, 366, 463, 417,  55,  81, 574, 350, 197, 684,
         18, 334, 428, 597, 185, 338, 295, 493, 215,  76, 159, 488, 188, 536,
        555, 710,  38, 408], device='cuda:0')
tensor([283, 151,   9, 548, 486,  82, 186, 236, 652, 623, 170,  45, 626, 457,
        504, 609, 514, 281, 163, 318, 591, 177, 560, 672, 588, 267, 700, 542,
        195, 444, 256, 635], device='cuda:0')
tensor([353, 311, 376, 371, 677,  14, 321, 320, 282,  72, 291, 665, 543, 259,
        551, 584, 733,  97, 301,  86, 679, 410, 600,  48,  19, 336, 189, 317,
        346, 716, 610, 735], device='cuda:0')
tensor([ 24, 223, 521, 257, 382, 569, 415, 587, 290,  69, 137, 312, 337, 734,
        232, 160, 717, 503, 713, 612,  21, 199, 158, 643, 575, 620, 594, 388,
        210,  64, 568, 556], device='cuda:0')
tensor([262,  79, 657,  92, 461,  54, 166, 187, 429, 308,   1, 485, 666,  42,
        496, 352,  37, 497, 675, 356, 203, 431, 482, 697, 243, 547, 305, 196,
        468, 314, 432,   8], device='cuda:0')
tensor([644,  11, 737, 322, 355, 383, 581,  95, 642, 335,  56, 743, 636, 686,
        659, 673, 397, 138, 292,  50,  98, 613, 279,  78, 358, 132, 549, 707,
        633, 681, 194, 390], device='cuda:0')
tensor([389, 454, 539, 540, 130, 480, 624, 426, 233, 167, 362, 459, 660, 123,
         88,  71, 699, 685, 435, 172, 219, 255, 127, 365, 656, 668, 269, 490,
        618, 310, 101,  44], device='cuda:0')
tensor([650, 152, 105, 506, 663, 168, 298, 270, 641, 142, 611, 446, 489, 647,
        407, 126, 607, 670,  70, 373, 557, 726, 598, 209,  87, 494, 227, 339,
        319, 714, 304, 272], device='cuda:0')
tensor([416, 525, 124, 722, 424, 671, 140,  35, 276, 667, 254, 622, 544, 742,
        582, 731,  58, 381,  74, 592, 438, 285, 749, 361, 206, 370, 487,  65,
        447, 134, 747, 414], device='cuda:0')
tensor([661, 349, 637, 698, 629, 676, 405,  10, 104, 143, 708, 694, 701, 146,
         89, 306, 303, 224, 534, 404, 451, 559, 207, 263,  43, 242, 538, 646,
        727, 386, 658, 385], device='cuda:0')
tensor([248, 357,  90, 564, 120, 164, 498, 608, 522, 472, 220, 736, 345, 509,
        293, 212, 234, 325, 616, 606, 589, 393, 528, 348, 102, 347,  61, 423,
        705, 718, 696, 596], device='cuda:0')
tensor([719, 343,  99, 192, 499, 674, 740, 723, 284, 122, 430, 501, 605, 109,
         77, 113, 579, 537, 112, 171, 333, 545, 561,  85,  34, 341,  84, 406,
        654,  20, 323, 316], device='cuda:0')
Time taken: 14.64 sec.

-----------------------------------
Epoch: [2/150]
tensor([ 57, 159, 486, 182,  69, 625, 607, 370, 742, 559, 406, 271, 174, 173,
        347, 338, 427, 743,  62,  74, 396, 185, 693, 603, 168, 731, 462, 266,
         32, 143,  70, 122], device='cuda:0')
tensor([ 68, 386, 648,  54, 456, 301, 366, 536, 204, 732, 523, 748, 481, 653,
        435, 468, 155, 272, 141,  50, 690, 516, 440, 264, 429, 639, 258,  41,
        539, 545, 205, 333], device='cuda:0')

No target value is less than 0 or greater than 750. I have 751 classes.
I don’t understand how criterion will know about n_classes. It’s neither defined during the initilization, criterion = nn.CrossEntropyLoss() and the error happens during the first call of loss = criterion(outputs, targets)
On CPU, it gives this error

  File "train.py", line 142, in train
    loss_id = criterion(outputs, targets)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 862, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1550, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/fstu1/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1407, in nll_loss
    return torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THNN/generic/ClassNLLCriterion.c:93

The number of classes is defined by the shape of your model’s output.
E.g. an output of the shape [batch_size, 10] will correspond to 10 different classes.
Try to put an assert statement or any other check inside your training loop to check for invalid values in your target.

ok, thank you. my mistake, the output of the model was (batch_size, 512). I had to change it to 751
Thank you