Using pytorch 2.0.0a0+gitfd3a726,Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX, stuck at the beginning

I use a CPU of AMD 5975WX, and four 4090 graphics cards. cuda version is cuda12, pytoch version is 2.0`P

sudo gdb python 23557
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
python: No such file or directory.
Attaching to process 23557
(gdb) bt
#0  0x00007ff8cda80680 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ff8cddd71ef in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ff8cddd9aef in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ff8cda8ef9f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ff8cdcabd7c in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007ff8cddd01c9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007ff8cda4f833 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007ff8cda4fd41 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007ff8cda508c8 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007ff8cdc1b381 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#10 0x00007ff874a42f49 in ?? ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libcudart-e409450e.so.11.0
#11 0x00007ff874a16e2d in ?? ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libcudart-e409450e.so.11.0
#12 0x00007ff874a67875 in cudaMemcpyAsync ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libcudart-e409450e.so.11.0
#13 0x00007ff87635ef28 in at::native::copy_kernel_cuda(at::TensorIterator&, bool) ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so
#14 0x00007ff8a2e3bff9 in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007ff8a2e3d472 in at::native::copy_(at::Tensor&, at::Tensor const&, bool) ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#16 0x00007ff8a39d443f in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#17 0x00007ff8a312cf48 in at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#18 0x00007ff8a3d1887b in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper___to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) () from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#19 0x00007ff8a3566165 in at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>)
    () from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007ff8a3b5ddd3 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional----Type <RET> for more,--Type <RET>--Type <RET> f--Type <RET> f--Type <RET> for--Type <RET> for--Type <RET> for more, q to quit, --Type <RET> for more, q to quit, c to continu--Type <RET>--Type <RET>--Type <RET> for--Type <--Type <RET> for more, q to quit, c to continue without paging--
<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>), &at::(anonymous namespace)::_to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) () from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007ff8a3566165 in at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) () from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007ff8a4f20b1b in torch::autograd::VariableType::(anonymous namespace)::_to_copy(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) () from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007ff8a4f20f8e in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>), &torch::autograd::VariableType::(anonymous namespace)::_to_copy>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat> > >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) () from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007ff8a35e66c9 in at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007ff8a312546b in at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007ff8a3ed6c81 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_dtype_layout_to>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) () from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007ff8a374d81e in at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) ()
   from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007ff8cca7a318 in torch::autograd::dispatch_to(at::Tensor const&, c10::Device, bool, bool, c10::optional<c10::MemoryFormat>) () from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so
#29 0x00007ff8cca7b780 in torch::autograd::THPVariable_cuda(_object*, _object*, _object*) () from /home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so
#30 0x0000555b32cc8b9c in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /opt/conda/conda-bld/python-split_1649141344976/work/Objects/descrobject.c:348
#31 0x0000555b32c0872f in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x555b34d983e0, callable=0x7ff8e3fc1db0, tstate=<optimized out>)
    at /opt/conda/conda-bld/python-split_1649141344976/work/Include/cpython/abstract.h:118
#32 PyObject_Vectorcall () at /opt/conda/conda-bld/python-split_1649141344976/work/Include/cpython/abstract.h:127
#33 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x555b34d3dc00) at /opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:5077
#34 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x555b34d98260, throwflag=<optimized out>) at /opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:3506
#35 0x0000555b32c9f663 in _PyEval_EvalFrame () at /opt/conda/conda-bld/python-split_1649141344976/work/Include/internal/pycore_ceval.h:40
#36 _PyEval_EvalCode (tstate=<optimized out>, _co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=0x0, kwargs=0x0, kwcount=<optimized out>, kwstep=2, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0, name=<optimized out>, qualname=0x0) at /opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:4329
#37 0x0000555b32d4c45c in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=0, kwargs=<optimized out>, kwnames=<optimized out>, argcount=<optimized out>,
    args=<optimized out>, locals=<optimized out>, globals=<optimized out>, _co=<optimized out>) at /opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:4361
#38 PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0)
    at /opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:4377
#39 0x0000555b32ca045b in PyEval_EvalCode (co=co@entry=0x7ff98e9c6660, globals=globals@entry=0x7ff98ea68d40, locals=locals@entry=0x7ff98ea68d40) at /opt/conda/conda-bld/python-split_1649141344976/work/Python/ceval.c:828
#40 0x0000555b32d4c50b in run_eval_code_obj (tstate=0x555b34d3dc00, co=0x7ff98e9c6660, globals=0x7ff98ea68d40, locals=0x7ff98ea68d40) at /opt/conda/conda-bld/python-split_1649141344976/work/Python/pythonrun.c:1221
#41 0x0000555b32d7cf75 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ff98ea68d40, locals=0x7ff98ea68d40, flags=<optimized out>, arena=<optimized out>)
    at /opt/conda/conda-bld/python-split_1649141344976/work/Python/pythonrun.c:1242
#42 0x0000555b32c1d987 in pyrun_file (fp=0x555b34d39340, filename=0x7ff98e9a4750, start=<optimized out>, globals=0x7ff98ea68d40, locals=0x7ff98ea68d40, closeit=1, flags=0x7ffc08274e28)
    at /opt/conda/conda-bld/python-split_1649141344976/work/Python/pythonrun.c:1140
#43 0x0000555b32d82a2f in pyrun_simple_file (flags=0x7ffc08274e28, closeit=1, filename=0x7ff98e9a4750, fp=0x555b34d39340) at /opt/conda/conda-bld/python-split_1649141344976/work/Python/pythonrun.c:450
#44 PyRun_SimpleFileExFlags (fp=0x555b34d39340, filename=<optimized out>, closeit=1, flags=0x7ffc08274e28) at /opt/conda/conda-bld/python-split_1649141344976/work/Python/pythonrun.c:483
#45 0x0000555b32d8310b in pymain_run_file (cf=0x7ffc08274e28, config=0x555b34d3c3b0) at /opt/conda/conda-bld/python-split_1649141344976/work/Modules/main.c:379
#46 pymain_run_python (exitcode=0x7ffc08274e20) at /opt/conda/conda-bld/python-split_1649141344976/work/Modules/main.c:604
#47 Py_RunMain () at /opt/conda/conda-bld/python-split_1649141344976/work/Modules/main.c:683
#48 0x0000555b32d83309 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /opt/conda/conda-bld/python-split_1649141344976/work/Modules/main.c:1129
#49 0x00007ff98f9a2d90 in __libc_start_call_main (main=main@entry=0x555b32c2a4a0 <main>, argc=argc@entry=2, argv=argv@entry=0x7ffc08275048) at ../sysdeps/nptl/libc_start_call_main.h:58
#50 0x00007ff98f9a2e40 in __libc_start_main_impl (main=0x555b32c2a4a0 <main>, argc=2, argv=0x7ffc08275048, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc08275038) at ../csu/libc-start.c:392
#51 0x0000555b32d0a0a0 in _start ()
(gdb)

See attachment for cuda log,My code is also in the attachment

import numpy as np
import os
import time
import PIL
import torch
from torch import Tensor
from torch.utils.data import Dataset, DataLoader
import torchvision
from torchvision import transforms
from torchvision.models.resnet import ResNet, BasicBlock

batch_size = 32
max_number_of_epoch = 4
LR = 0.1
image_size = 224
number_of_classes = 10
clipping_value = 512

np.random.seed(2)
torch.manual_seed(2)

n_cpu = int(os.cpu_count()*0.5)
n_cpu = 8

class Average_Meter(object):
    """Computes and stores the average and current value"""

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0.0
        self.avg = 0.0
        self.sum = 0.0
        self.count = 0

    def update(self, val, n):
        if n > 0:
            self.val = val
            self.sum += val * n
            self.count += n
            self.avg = self.sum / self.count


class Sum_Meter(object):
    """Computes and stores the sum and current value"""

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0.0
        self.avg = 0.0
        self.sum = 0.0
        self.count = 0

    def update(self, val, n):
        if n > 0:
            self.val = val
            self.sum += val
            self.count += n
            self.avg = self.sum / self.count


def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(1.0 / batch_size))
        return res


transform = transforms.Compose([
    transforms.Resize(size=(image_size, image_size),
                      interpolation=PIL.Image.BICUBIC),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])])

trainset = torchvision.datasets.CIFAR10(
    root='./data', train=True,  download=False, transform=transform)
train_loader = torch.utils.data.DataLoader(
    trainset, batch_size=batch_size, shuffle=True, num_workers=n_cpu)

testset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=False, transform=transform)
val_loader = torch.utils.data.DataLoader(
    testset, batch_size=batch_size, shuffle=False, num_workers=n_cpu)


class MyResNet18(ResNet):
    def __init__(self, num_classes):
        super(MyResNet18, self).__init__(BasicBlock,
                                         [2, 2, 2, 2], num_classes=num_classes)

    def _forward_impl(self, x: Tensor) -> Tensor:
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x)
        FV = torch.flatten(x, 1)
        Logit = self.fc(FV)
        return FV, Logit


model = MyResNet18(num_classes=number_of_classes)

best_val = 0.0

optimizer = torch.optim.Adam(model.parameters(), lr=LR)

device = torch.device('cuda')

flag_multi_gpu = True

if flag_multi_gpu and torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = torch.nn.DataParallel(model)
model.to(device)

loss_CE = torch.nn.CrossEntropyLoss(reduction='sum').cuda()

torch.backends.cudnn.benchmark = True

losses_train_total = Sum_Meter()
top1_train = Average_Meter()
top5_train = Average_Meter()
losses_val_total = Sum_Meter()
top1_val = Average_Meter()
top5_val = Average_Meter()

for epoch in range(max_number_of_epoch):
    print("\nepoch = ", epoch + 1)
    losses_train_total.reset()
    top1_train.reset()
    top5_train.reset()
    losses_val_total.reset()
    top1_val.reset()
    top5_val.reset()
    t1 = time.time()
    model.train()
    for i, (x, y) in enumerate(train_loader, 0):
        print("iter %05d: -------DEBUG0------" % i)
        optimizer.zero_grad()
        print("iter %05d: -------DEBUG1------" % i)
        x = x.cuda()
        print("iter %05d: -------DEBUG2------" % i)
        y = y.detach().clone().long().cuda()
        print("iter %05d: -------DEBUG3------" % i)
        FV, Logit = model(x)
        print("iter %05d: -------DEBUG4------" % i)
        prec1, prec5 = accuracy(Logit.data, y, topk=(1, 5))
        loss = loss_CE(Logit, y)
        losses_train_total.update(loss.item(), y.size(0))
        top1_train.update(prec1.item(), y.size(0))
        top5_train.update(prec5.item(), y.size(0))
        assert not torch.isnan(loss)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clipping_value)
        optimizer.step()
        # print("iter %05d: -------DEBUG6------" % i)
    model.eval()
    # print("iter %05d: -------DEBUG4------" % i)
    with torch.no_grad():
        for i, (x, y) in enumerate(val_loader, 0):
            x = x.cuda()
            y = y.detach().clone().long().cuda()
            # print("iter %05d: -------DEBUG5------" % i)
            FV, Logit = model(x)
            # print("iter %05d: -------DEBUG6------" % i)
            prec1, prec5 = accuracy(Logit.data, y, topk=(1, 5))
            loss = loss_CE(Logit, y)
            # print("iter %05d: -------DEBUG7------" % i)
            losses_val_total.update(loss.item(), y.size(0))
            top1_val.update(prec1.item(), y.size(0))
            top5_val.update(prec5.item(), y.size(0))
            # print("iter %05d: -------DEBUG8------" % i)
    t2 = time.time()
    print("iter %05d: -------DEBUG9------" % i)
    print('train average_loss_total', losses_train_total.avg)
    print('train top1 accuracy', top1_train.avg)
    print('train top5 accuracy ', top5_train.avg)
    print('validation average_loss_total', losses_val_total.avg)
    print('validation top1 accuracy', top1_val.avg)
    print('validation top5 accuracy ', top5_val.avg)
    print("epoch time = ", t2-t1)
    if top1_val.avg > best_val:
        best_val = top1_val.avg
        print("model saved with vallidation top-1 accuracy  =  ", best_val)
        torch.save(model.state_dict(),
                   f"resnet_18_cifar_10_vall_acc_{best_val}.pth.tar")

print('Finished Training')

Some information I have learned is that there is no such problem with intel CPUs. The solution given in this Issue is to turn off the IOMMU of the motherboard, but this option is what I need, so I want to know if there are other solutions

I know this should not be a pytorch problem, I’m not sure if I can get some solution from the pytorch community

I’m not sure if I understand the issue completely, but IOMMU is a common problem as you’ve pointed out and disabling it could solve the hang. Did you try it and does it work?

I haven’t tried this option yet, but this option is what I need, I need to deploy a virtualized platform on the machine