Hi,
I’ve recently got ROCM working on Pytorch. My previously trained model for CNN was using CPU for training but when I switched to CUDA, I get a seg fault at the following line. I have no issues when swapping the model to the device, just when using it for the dataset.
for images, labels in train_dataloader:
images = images.to(device)
My dataset loader.
class CatDogDataset(Dataset):
def __init__(self, root_dir, transform=None):
self.root_dir = root_dir
self.transform = transform
self.file_list = os.listdir(root_dir)
def __len__(self):
return len(self.file_list)
def __getitem__(self, idx):
img_name = self.file_list[idx]
img_path = os.path.join(self.root_dir, img_name)
image = Image.open(img_path).convert('RGB')
if self.transform is not None:
image = self.transform(image)
label = str.split(img_name, '.')[0] # Convert label to integer
if label == "dog":
label = torch.tensor([1, 0]) # One-hot encode as [1, 0] for dog
elif label == "cat":
label = torch.tensor([0, 1]) # One-hot encode as [0, 1] for cat
try:
return image, label.float()
except Exception as e:
print(e)
print(img_name, label)
breakpoint()
My environment is.
{
"torch_version": "2.1.0.dev20230717+rocm5.5",
"cuda_available": true,
"cuda_version": null,
"Number of devices": 2,
"Name of device": "Radeon RX 7900 XTX"
}
and the stack trace logs when using gdb are as follows.
0x00007fffa82b0527 in hip::FatBinaryInfo::AddDevProgram(int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
(gdb) bt
#0 0x00007fffa82b0527 in hip::FatBinaryInfo::AddDevProgram(int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#1 0x00007fffa82b0780 in hip::FatBinaryInfo::BuildProgram(int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#2 0x00007fffa82b3ade in hip::Function::getStatFunc(ihipModuleSymbol_t**, int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#3 0x00007fffa826e37b in hip::StatCO::getStatFunc(ihipModuleSymbol_t**, void const*, int) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#4 0x00007fffa83d6708 in ihipLaunchKernel(void const*, dim3, dim3, void**, unsigned long, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, int) ()
from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#5 0x00007fffa83af5a2 in hipLaunchKernel_common () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#6 0x00007fffa83bde12 in hipLaunchKernel () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libamdhip64.so
#7 0x00007fffaa09b28f in void at::native::gpu_kernel_impl<at::native::CUDAFunctor_add<float> >(at::TensorIteratorBase&, at::native::CUDAFunctor_add<float> const&) ()
from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#8 0x00007fffaa084e96 in at::native::add_kernel(at::TensorIteratorBase&, c10::Scalar const&) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#9 0x00007fffab4024c0 in at::(anonymous namespace)::wrapper_CUDA_add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar const&) ()
from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#10 0x00007fffdf5f1ffe in at::_ops::add__Tensor::call(at::Tensor&, at::Tensor const&, c10::Scalar const&) ()
from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007fffab25c02f in at::native::miopen_convolution_add_bias_(char const*, at::TensorArg const&, at::TensorArg const&) ()
from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#12 0x00007fffab25d122 in at::native::miopen_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#13 0x00007fffab3af7a2 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__miopen_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) () from /home/muro/Desktop/ML-Stuff/venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#14 0x00007fffab3af8b1 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&,
Any recommendations on how I can debug further? I am not fully experienced with gdb, and will like to learn the industry standard for debugging such issues.