Pytorch execution slower on A100 than V100

Hey there!

I have been using V100s to run my code, which executes pretty fast; however, executing the same code using A100s is up to 3x slower. Any pointers on how to make my code execution faster on A100?

System configuration using V100:
Cuda compilation tools, release 11.4, V11.4.48
Build cuda_11.4.r11.4/compiler.30033411_0
Pytorch version: 1.10.1

System configuration using A100:
Cuda compilation tools, release 10.1, V10.1.243
Pytorch version: 1.9.1+cu111

Could you post a minimal, executable code snippet showing this slowdown, please?

Thanks for the update. Could you make the code snippet executable and report your performance numbers showing the slowdown, so that we could try to reproduce and debug it?

Solved! There was some issue with the python version. I upgraded it to 3.7 from 3.6 and it works fine as expected.

Good to hear you’ve solved it.
Also, in case you are running into more issues: your A100 needs CUDA 11.x, so this setup looks also incorrect:

System configuration using A100:
Cuda compilation tools, release 10.1, V10.1.243

Thanks for the suggestion. However, I am still facing some issues. I use TransformerEncoderLayer in my model, which throws an error if I train using Mixed Precision, but it works fine when using float32 precision.

Error:
Traceback (most recent call last): File "main.py", line 91, in <module> trainer.fit(train_module, train_dataloader, valid_dataloader) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 697, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _run_train self._run_sanity_check() File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1343, in _run_sanity_check val_loop.run() File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 143, in advance output = self._evaluation_step(**kwargs) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 240, in _evaluation_step output = self.trainer._call_strategy_hook(hook_name, *kwargs.values()) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook output = fn(*args, **kwargs) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 370, in validation_step return self.model.validation_step(*args, **kwargs) File "/nlsasfs/home/nltm-st/vipular/AFP2/src/train/trainer.py", line 96, in validation_step return self.step(batch, mode="valid") File "/nlsasfs/home/nltm-st/vipular/AFP2/src/train/trainer.py", line 50, in step cts_anc_emb, cts_pos_emb, dis_anc_emb, dis_pos_emb = self(anc, pos) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/nlsasfs/home/nltm-st/vipular/AFP2/src/train/trainer.py", line 41, in forward cts_anc_emb = self.encoder(anc) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/nlsasfs/home/nltm-st/vipular/AFP2/src/models/encoder.py", line 91, in forward context_emb = self.encoder(pos_enc) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 238, in forward output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/nlsasfs/home/nltm-st/vipular/anaconda3/envs/dum/lib/python3.7/site-packages/torch/nn/modules/transformer.py", line 456, in forward src_mask if src_mask is not None else src_key_padding_mask, # TODO: split into two args RuntimeError: expected scalar type Half but found Float

class Encoder(nn.Module):

    def __init__(self, inp_dims,patch_size, nhead, dim_feedforward, num_layers, concat_position=False):
        super().__init__()
       
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=inp_dims, nhead=nhead, dim_feedforward=dim_feedforward, dropout=0.1, layer_norm_eps=1e-05, batch_first=True)
        self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)
        self.patch_embedding_layer = PatchEmbedding_Layer(patch_size, inp_dims, concat_position)
        
    def forward(self,x):
  
        pos_enc = self.patch_embedding_layer(x)
        context_emb = self.encoder(pos_enc)
        return context_emb 

Compute Environment:

  1. Cuda compilation tools, release 11.3, V11.3.58
    Build cuda_11.3.r11.3/compiler.29745058_0

  2. Torch version: 1.12.1

Again, the same request: could you post a minimal, executable code snippet to reproduce the issues, please?

Sorry for not sharing the code snippet! You may try to execute the code given hereunder. I tried to execute the code on A100 and V100, and I found it to be 2x slower on A100 than on V100. This doesn’t concern the Mixed Precision problem that I stated before.

I created two separate virtual envs on the A100 machine, both have the same PyTorch version(1.12.1) and cuda(11.3.); however, one of them is too slow(~40s) to execute than the other(2.3s). This is bizarre! It takes ~0.9s to execute on V100.

import torch
import torch.nn as nn
import time
import numpy as np
import torch.nn.functional as F

class PatchEmbedding_Layer(nn.Module):
    def __init__(self, patch_size, emb_dims, concat_position=False):
        
        super().__init__()
        self.patch_size = patch_size
        self.concat_position = concat_position
        self.layer = nn.Conv2d(1, emb_dims, kernel_size=patch_size, stride=patch_size)
    
    def get_positional_encoding(self, n_positions, n_dims, n=10000):
        sin_part = np.sin(np.arange(n_positions).reshape(-1,1)/(n**(2*(np.arange(0,n_dims/2).reshape(1,-1))/n_dims)), dtype=np.float32)
        cos_part = np.cos(np.arange(n_positions).reshape(-1,1)/(n**(2*(np.arange(0,np.floor(n_dims/2)).reshape(1,-1))/n_dims)), dtype=np.float32)
        pos_matrix = np.empty((1, n_positions,n_dims), dtype=np.float32)
        pos_matrix[:,:,np.arange(0,n_dims,2)] = sin_part
        pos_matrix[:,:,np.arange(1,n_dims,2)] = cos_part
        return torch.from_numpy(pos_matrix).cuda()

    def forward(self, image):
        if image.dtype != torch.float32:
            raise TypeError(f"Input of type torch.float32 expected. Type {image.dtype} is passed") 
        
        if image.shape[-1] % self.patch_size[1] > 0:
            pad_length = self.patch_size[1]*np.ceil(image.shape[-1]/self.patch_size[1]) - image.shape[-1]
            image = F.pad(image, (0,int(pad_length)))

        x = self.layer(image)
        x = x.view(x.shape[0], -1, x.shape[1])
        pos_enc = self.get_positional_encoding(x.shape[1], x.shape[2])
        
        if self.concat_position:
            out = torch.cat((x, pos_enc.expand(x.shape[0],-1,-1)), dim=-1)
        else:
            out = x + pos_enc
        return out


class Encoder(nn.Module):
   
    def __init__(self, inp_dims,patch_size, nhead, dim_feedforward, num_layers, concat_position=False):
        super().__init__()
        
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=inp_dims, nhead=nhead, dim_feedforward=dim_feedforward, dropout=0.1, layer_norm_eps=1e-05, batch_first=True)
        self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)
        self.patch_embedding_layer = PatchEmbedding_Layer(patch_size, inp_dims, concat_position)
        
    def forward(self,x):
        pos_enc = self.patch_embedding_layer(x)
        context_emb = self.encoder(pos_enc)
        return context_emb 


model = Encoder(128, (64,10), 8, 2048,8).cuda()
s = time.time()
for i in range(100):
    x = torch.rand((128,1,64,100), device=torch.device('cuda'))
    model(x)
print(time.time()-s)

I cannot reproduce the issue on an A100 and V100 node and see a runtime of ~0.63s and ~0.70s, respectively. You also cannot measure the runtime on the host without synchronizations since CUDA ops are executed asynchronously, so I’ve added torch.cuda.synchronize() calls before starting and stopping the timers.
However, in your use case this shouldn’t make a huge difference, since your overall approach already synchronizes the code by using e.g. numpy ops.

Are you actually compiling pytorch with the right cuda_compute capability?

Did you enable tf32?

Another issue is you are not calling torch.cuda.synchronize(), timings are not valid without ensuring work on gpu is complete.

Enabling or disabling TF32 won’t change the runtime from ~2s to 40s (see my last post with proper profile timings), so I still assume it’s a setup issue as already mentioned.

Problem solved! There were indeed some CUDA compatibility issues. A100 is almost 3x faster than V100 in my case.