High validation loss on AMD RX6600 GPU but ok on NVIDIA GPU

Hello everyone. I am facing an issue. I am explaining what I am trying to do.
I have a Traffic and Road sign dataset that contains 43 classes. I am trying to classify the images. I am using the resnet34 pre-trained model. I have AMD RX6600 GPU that I use for running the model. For running the model on my AMD GPU I am using Pytorch Directml and using this code

import torch_directml
dml = torch_directml.device()

to find the device. The using this dml instance, I push the mode and training data to the GPU. Until now everything has worked fine. Training speed is fast enough, and GPU utilization is near 100%. Training loss decreases per epoch. But when I check the model using validation data after one training phase, validation loss increases and validation accuracy is too low. But training is ok. When I run the same code on my friend’s PC who has NVIDIA GPU, all is ok. Validation loss decreases and it converges. And I got an accuracy of 98% when running the same code on NVIDIA GPU. I can not figure out what the problem is. I also tune the hyperparameter but had no luck. And one strange thing is that this problem arises when I use CNN based model. I had run NLP pre-trained model BERT on my AMD GPU and there is no Issue. Validation loss decreases and it converges. Can anyone help me with this issue? I am giving the code below. Thanks in advance.

Model Initialization

def create_model():
      model = torchvision.models.resnet34(weights='ResNet34_Weights.DEFAULT')
      n_features = model.fc.in_features

      model.fc = nn.Sequential(
          nn.Linear(n_features, 256),
          nn.ReLU(),
          nn.Linear(256, 128),
          nn.ReLU(),
          nn.Linear(128, 43)
    )

    return model.to(dml)

base_model = create_model()

Hyper parameters

num_classes = 43
num_epochs = 10
learning_rate = 1e-4
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(base_model.parameters(), lr=learning_rate)

Training and Validation loop

def train_model():
    since = time.time()
    val_acc_history = []
    
    best_model_wts = copy.deepcopy(base_model.state_dict())
    best_acc = 0.0
    
    progress_bar_train = tqdm(range(num_epochs * len(train_loader)))
    progress_bar_eval = tqdm(range(num_epochs * len(validation_loader)))
    
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)
        
        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                base_model.train()  # Set model to training mode
            else:
                base_model.eval()   # Set model to evaluate mode
                
            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(dml)
                labels = labels.to(dml)
                
                # zero the parameter gradients
                optimizer.zero_grad()
                
                with torch.set_grad_enabled(phase == 'train'): 
                    outputs = base_model(inputs)
                    loss = criterion(outputs, labels)

                    _, preds = torch.max(outputs, 1)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()
                        progress_bar_train.update(1)
                    elif phase == 'val':
                        progress_bar_eval.update(1)
                        
                running_loss += loss.item() * inputs.size(0)
                preds = preds.cpu()
                labels = labels.data.cpu()
                running_corrects += (preds == labels).sum()
            
            print("Lenght: ", len(dataloaders[phase].dataset))
            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = float(running_corrects) / len(dataloaders[phase].dataset)

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc))        
            
            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(base_model.state_dict())
            if phase == 'val':
                val_acc_history.append(epoch_acc)

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    base_model.load_state_dict(best_model_wts)
    return base_model, val_acc_history

Calling the training function

best_model, validation_acc_hist = train_model()

Please help me

Hi Atiqur!

I don’t know anything about Directml nor your AMD gpu, but this does
seem to be the kind of thing that could cause this sort of discrepancy.
I think that Directml is pretty new and still a work in progress. You might
check that you’re using the latest version in case there are any recent
bug fixes.

Of course things besides the gpu could differ between your friend’s
system and yours. Check, for example, your version of pytorch (and
python).

Then I would try some simple tests to narrow things down.

First, make sure that you can reproduce your calculations when you
rerun them. In particular, use torch.manual_seed() to initialize
pytorch’s random-number generator to a specific initial state.

Perform a single forward pass using your cpu and make sure that you
can reproduce it. Then try the same forward pass on your gpu. Do
you get the same result as on your cpu, up to some reasonable
floating-point round-off error?

If not, try to find the simplest operation that gives different results on your
cpu and gpu. For example, try passing a single input sample through a
single Linear layer or even something as simple as a single matrix
multiplication.

If your cpu / gpu results agree, try the same experiment on your friend’s
system.

If all four results (cpu / gpu on the two systems) agree for a single forward
pass, then try a single backward pass and see if you get if you get the
same gradients. Then try a single optimizer step, and so on.

The idea is to narrow things down to where you first get a discrepancy
(and determine whether that discrepancy is specifically due to your AMD
gpu).

If it is your gpu (with Directml), please post the simplest script you can
put together that reproduces the discrepancy. (So, if a single matrix
multiplication reproduces the issue, don’t post a script the runs several
epochs of training on a resnet.)

Best.

K. Frank

Thank you so much, I will try all the steps you mentioned.

After a lot of debugging, I found something. so I have figured out something. After one backward pass, I printed the value using this code.

for param in base_model.parameters():
    print((param.grad.data).cpu().sum())

when the model is in the CPU, it prints sum numbers which means all the model parameters have been calculated after one pass.

But when the model is in the GPU this error happens.

tensor(-4.5475e-13)
tensor(2.2737e-13)
tensor(4.5475e-13)
tensor(-1.3588e-06)
tensor(-1.8833e-06)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[91], line 2
      1 for param in base_model.parameters():
----> 2     print((param.grad.data).cpu().sum())

AttributeError: 'NoneType' object has no attribute 'data'

Some of the model parameters are none. I think this is the root cause of the issue.

My manual one iteration code is given below

num_classes = 2
num_epochs = 40
learning_rate = 1e-4
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(base_model.parameters(), lr=learning_rate)

#taking only the first batch
for batch in dataloaders['train']:
    batch = {k: v.to(dml) for k, v in batch.items()}
    break

#forward pass
base_model.train()
outputs = base_model(**batch)
labels = batch['Type']

#backward pass
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()

#to see any parameters get updated
for the param in base_model.parameters():
    print((param.grad.data).cpu().sum()) 

Can you please tell me why this happens when I run the model in GPU?

Hi Atiqur!

This is not quite correct. Your model parameter, param is not None; rather,
its .grad property is None.

It appears that the backward pass is not populating the grads of your
parameters. Try using just a single Linear as a simple model, run a
single forward and backward pass on your gpu. Do the Linear’s grads
get populated. (Make sure that they do get populated when you run the
same experiment on your cpu.)

(As an aside, don’t use .data. It is deprecated and can cause errors. Also,
please don’t post screenshots of textual data – doing so breaks accessibility,
searchability, and copy-paste.)

What happens when you run the exact same code on your friend’s nvidia
gpu? Do the grads get populated on your friend’s gpu or do you get the
same error.

Best.

K. Frank

When I run it on my friend’s GPU it is totally fine. Also when I run it on my CPU it is also fine. Grad is populated. But problems arise with my GPU. I suspect directml is doing some wrong. And yes, I remove .data. param.grad is none when I run the code in GPU. But in my CPU it is totally fine. All the param grad populates.

tensor(-4.5475e-13)
tensor(4.5475e-13)
tensor(-4.5475e-13)
tensor(-1.8066e-06)
tensor(-1.9463e-06)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[49], line 2
      1 for param in base_model.parameters():
----> 2     print((param.grad).cpu().sum()) 

AttributeError: 'NoneType' object has no attribute 'cpu'

Hi Atiqur!

This does sound like there is something wrong with Directml and your
amd gpu that is causing backpropagation to fail.

I would imagine that the Directml folks would appreciate a bug report
about this. The best thing would be to continue debugging to find the
smallest, simplest example that triggers the bug. Then post a minimal,
fully-self-contained, runnable script that reproduces the issue, together
with the output you get.

Be sure to note the python, pytorch, and Directml versions you are running
with.

Good luck.

K. Frank

My python version is 3.10.8, PyTorch version is 1.13.1 and torch directml version is 0.1.13.dev221216