Runtimeerror cudnn_status_execution_failed( cudnn_batch_norm_backward allow unreachable flaf

marwa1 · September 30, 2020, 12:41am

I am trying to train my first CNN after trying an existing one but it didn’t work . this error was appeared and i cann’t understand it.

I work on ubuntu 18.0 , GTX 1660 Ti 6G. this is a code sample that i think causes the error:

the code

for epoch in range(num_epochs):
model.train()
for batch_idx, (features, targets, levels, x) in enumerate(train_loader):
features = features.to(DEVICE)
targets = targets
targets = targets.to(DEVICE)
levels = levels.to(DEVICE)

    logits, probas = model(features)
    if epoch >= 190:
        print('\n i=',batch_idx,'logits  =', logits)
        print( '\n i=',batch_idx,' probas =',probas)

    impf=torch.ones([logits.shape[0], NUM_CLASSES])
    for i in range(len(x) ):
        impf[i]=impFactor(x[i])
    impf = impf.to(DEVICE)                
    logits= (logits * impf).to(DEVICE)
    cost = cost_fn(logits, levels)
    optimizer.zero_grad()
    cost.backward()
    optimizer.step()

ptrblck · October 1, 2020, 3:09am

Which PyTorch, CUDA and cudnn versions are you using?
Also, could you post the model definition as well as the shapes of all tensors, so that we could reproduce and debug this issue, please?

marwa1 · October 3, 2020, 11:45pm

I am using pytorch 1.5.0 and my cuda version is 10.2 on ubuntu 18.0.

I used resrnet34:
def resnet34(num_classes, grayscale):
“”“Constructs a ResNet-34 model.”“”
model = ResNet(block=BasicBlock,
layers=[3, 4, 6, 3],
num_classes=num_classes,
grayscale=grayscale)
return model
Now when I run my script on a small dataset it work perfect. but the same script on the large dataset caused a different error in the line: impf=impf.to(DEVICE) …RuntimeError: CUDA error: unspecified launch failure .

marwa1 · October 3, 2020, 11:49pm

it works fine till epoch=31 then the error appears: RuntimeError: CUDA error: unspecified launch failure
Sometimes it work fine till epoch=39 then the same error appears

marwa1 · October 4, 2020, 12:34am

In another time, the run stopped at epoch=107 of 200 epochs

ptrblck · October 4, 2020, 3:18am

Could you check, if you are running out of memory, please?

marwa1 · October 4, 2020, 7:53pm

do you mean memcheck … yes i do and this is the output:
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

marwa1 · October 4, 2020, 7:54pm

and if this may be the cause what is the solution ??? reduce the batch size can fix

ptrblck · October 5, 2020, 6:05am

No, I meant if your GPU memory is filling up and you thus cannot allocate any more data on the device.
You can check the memory usage via nvidia-smi or in your script via e.g. torch.cuda.memory_allocated().

Are you using custom CUDA code or did you execute cuda-memcheck just on the complete PyTorch model?

marwa1 · October 5, 2020, 6:06pm

I executed 'cuda-memcheck` on the complete PyTorch model

marwa1 · October 5, 2020, 6:09pm

please explain more. where to add this in my script?

marwa1 · October 5, 2020, 6:12pm

is this true where the tensors ‘logits and impf’ have the same size ?

ptrblck · October 5, 2020, 11:28pm

You could add it e.g. at the beginning and at the end of each iteration to check the allocated memory, which would show if you are close to the device limit. Note that this call does not return the memory usage of the CUDA context or from other applications.

marwa1 · October 8, 2020, 12:37pm

the output is:
Epoch: 001/200 | Batch 0000/0343 | Cost: 38.0783
memory allocated after 369382912
memory allocated before 346285056
memory allocated after 369382912
memory allocated before 346285056
memory allocated after 369382912
memory allocated before 346285056
memory allocated after 369382912
memory allocated before 346285056
memory allocated after 369382912
memory allocated before 346285056
memory allocated after 369382912

and this repeated for all epochs …so I found that the used memory is almost constant during each iteration. in the other side the run stopped at epoch 17 with a new error :

ptrblck · October 8, 2020, 10:46pm

In that case could you rerun the code with

CUDA_LAUNCH_BLOCKING=1 python script.py args

and post the stack trace here?

marwa1 · October 9, 2020, 12:12pm

I am already using this format

ptrblck · October 9, 2020, 5:26pm

Could you update to PyTorch 1.6 or the nightly/master, since 1.5 had an issue where device assert statements were ignored.
This could mean that you are in fact hitting a valid assert.

marwa1 · October 12, 2020, 12:11pm

Are you mean that PyTorch `1.5’ is the reason not an error in my script

marwa1 · October 12, 2020, 12:14pm

I noticed that the code stoped at the same line in many time :
impf=impf.to(device)

where it done will at the small data

ptrblck · October 13, 2020, 5:00am

Yes, since (some) assert statements were broken in PyTorch 1.5, thus you would have to update to 1.6 or the nightly binary.

Is this a new issue or why is the code not crashing anymore?