CUDNN_STATUS_NOT_INITIALIZED with ditributed training

Hi. I am running the code on four L4 GPUs on gcp with distributed training using torch 1.8 and cuda 11.8.
Following code gives
self.proj = nn.Conv3d(
in_channels=in_chans,
out_channels=embed_dim,
kernel_size=(self.tubelet_size, patch_size[0], patch_size[1]),
stride=(self.tubelet_size, patch_size[0], patch_size[1]))
x = self.proj(x).flatten(2).transpose(1, 2)

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Same code works fine if I don’t use distributed training. Please let me know what could be the issue.

Could you update PyTorch to the latest stable release 2.1.2+cu121 and check if you would still run into this issue?

Thanks. It worked with latest version. But I am not sure why it should not work with older version.