CUDA unhandled exception

Hello,
I have a small net running in C++.
Input and target defined as following:
auto input = torch::randn({ TRAIN_BATCH_SIZE, 1, IMAGE_SIZE, IMAGE_SIZE }).to(device);
auto target = torch::zeros(TRAIN_BATCH_SIZE, torch::kInt64).to(device);

Filling the data with memcpy or cudaMemcpy if using CUDA or not. With C++ the net converges. With CUDA I get an exception in loss.backward()

// memcpy or cudaMemcpy
LoadEpoch(input, target, cvImage, cvResult, imgcount);

optimizer.zero_grad();   // zero the gradient buffers
auto output = net->forward(input);
auto loss = criterion(output, target);
loss.backward();
optimizer.step();

The layout of all tensors is the same using CUDA or not. Result of Forward() works always and NLLLoss. Just backward() throws an exception

c10::Error address 0x0000005A15D6C770.

The net is set to device and the optimizer is built with net->parameters().

Many thanks for your help.

With help from Libtorch loss.backward(); C10 Error
I got the error description ‘CUDA error: device-side assert triggered’.
On CUDA Error: Device-Side Assert Triggered: Solved | Built In
it is described ‘caused by an inconsistency between the number of labels and output units or an incorrect input for a loss function’. So I checked my target tensor

auto target = torch::zeros(TRAIN_BATCH_SIZE, torch::kInt64).to(device);

Without cudaMemcpy there is no exception. The error was in

cudaMemcpy(&resdata[i * NET_OUTPUT_COUNT + j], &cvResult[imgcount], sizeof(int64_t), cudaMemcpyHostToDevice);