Seg Fault when pushing data to device

Hi,

I’ve recently got ROCM working on Pytorch. My previously trained model for CNN was using CPU for training but when I switched to CUDA, I get a seg fault at the following line. I have no issues when swapping the model to the device, just when using it for the dataset.

    for images, labels in train_dataloader:
        images = images.to(device)

My dataset loader.

class CatDogDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        self.root_dir = root_dir
        self.transform = transform
        self.file_list = os.listdir(root_dir)

    def __len__(self):
        return len(self.file_list)

    def __getitem__(self, idx):
        img_name = self.file_list[idx]
        img_path = os.path.join(self.root_dir, img_name)
        image = Image.open(img_path).convert('RGB')

        if self.transform is not None:
            image = self.transform(image)

        label = str.split(img_name, '.')[0]  # Convert label to integer
        if label == "dog":
            label = torch.tensor([1, 0])  # One-hot encode as [1, 0] for dog
        elif label == "cat":
            label = torch.tensor([0, 1])  # One-hot encode as [0, 1] for cat
        
        try:
            return image, label.float()
        except Exception as e:
            print(e)
            print(img_name, label)
            breakpoint()

My environment is.

{
    "torch_version": "2.1.0.dev20230717+rocm5.5",
    "cuda_available": true,
    "cuda_version": null,
    "Number of devices": 2,
    "Name of device": "Radeon RX 7900 XTX"
}

and the stack trace logs when using gdb are as follows.

0x00007fffa82b0527 in hip::FatBinaryInfo::AddDevProgram(int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
(gdb) bt
#0  0x00007fffa82b0527 in hip::FatBinaryInfo::AddDevProgram(int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#1  0x00007fffa82b0780 in hip::FatBinaryInfo::BuildProgram(int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#2  0x00007fffa82b3ade in hip::Function::getStatFunc(ihipModuleSymbol_t**, int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#3  0x00007fffa826e37b in hip::StatCO::getStatFunc(ihipModuleSymbol_t**, void const*, int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#4  0x00007fffa83d6708 in ihipLaunchKernel(void const*, dim3, dim3, void**, unsigned long, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, int) ()
   from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#5  0x00007fffa83af5a2 in hipLaunchKernel_common () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#6  0x00007fffa83bde12 in hipLaunchKernel () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#7  0x00007fffaa09b28f in void at::native::gpu_kernel_impl<at::native::CUDAFunctor_add<float> >(at::TensorIteratorBase&, at::native::CUDAFunctor_add<float> const&) ()
   from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#8  0x00007fffaa084e96 in at::native::add_kernel(at::TensorIteratorBase&, c10::Scalar const&) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#9  0x00007fffab4024c0 in at::(anonymous namespace)::wrapper_CUDA_add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar const&) ()
   from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#10 0x00007fffdf5f1ffe in at::_ops::add__Tensor::call(at::Tensor&, at::Tensor const&, c10::Scalar const&) ()
   from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007fffab25c02f in at::native::miopen_convolution_add_bias_(char const*, at::TensorArg const&, at::TensorArg const&) ()
   from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#12 0x00007fffab25d122 in at::native::miopen_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#13 0x00007fffab3af7a2 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__miopen_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#14 0x00007fffab3af8b1 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&,

Any recommendations on how I can debug further? I am not fully experienced with gdb, and will like to learn the industry standard for debugging such issues.