Problem with model = torchvision.models.resnet50(pretrained=True) in multinode multi GPU setting with distributed data parallel

Mona_Jalal · May 18, 2023, 7:16pm

I get this error when I use STANDARD_NC24 cluster with 4 nodes each having 4 K80 GPUs in Microsoft Azure (enterprise). I am using DistributedDataParallel using NCCL backend and eth0. I do pipeline the job hence I use env:// for dist_url for init_process_group.

5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO Bootstrap : Using eth0:10.0.0.5<0>
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO P2P plugin IBext
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NET/IB : No device found.
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.5<0>
5225797bbbfa41b58e9cac81360abc94000001:47:47 [3] NCCL INFO Using network Socket
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_width, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_width, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_speed, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_width, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_speed, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_width, ignoring
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/6045bd7e-4468-6045-bd7e-44686045bd7e is not a PCI device (vmbus). Attaching to first CPU
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Attribute coll of node net not found
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO === System : maxWidth 5.0 totalWidth 12.0 ===
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO CPU/0 (1/1/1)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[5000.0] - NIC/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[12.0] - GPU/100000 (4)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[12.0] - GPU/200000 (5)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[12.0] - GPU/300000 (6)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO + PCI[12.0] - GPU/400000 (7)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO ==========================================
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO GPU/300000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (0/5000.000000/LOC) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO GPU/400000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO NET/0 :GPU/100000 (3/5.000000/PHB) GPU/200000 (3/5.000000/PHB) GPU/300000 (3/5.000000/PHB) GPU/400000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC)
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 5.000000/5.000000, type PHB/PHB, sameChannels 1
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, speed 6.000000/5.000000, type PHB/PHB, sameChannels 1
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Ring 00 : 6 -> 7 -> 8
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Ring 01 : 6 -> 7 -> 8
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Setting affinity for GPU 3 to 0fff
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Channel 00 : 7[400000] -> 8[100000] [send] via NET/Socket/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Channel 01 : 7[400000] -> 8[100000] [send] via NET/Socket/0
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Connected all rings
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Channel 00 : 7[400000] -> 6[300000] via direct shared memory
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Channel 01 : 7[400000] -> 6[300000] via direct shared memory
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO Connected all trees
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
5225797bbbfa41b58e9cac81360abc94000001:47:213 [3] NCCL INFO comm 0x14ec70001240 rank 7 nranks 16 cudaDev 3 busId 400000 - Init COMPLETE
NCCL version is:  (2, 10, 3)
MLflow version: 2.3.2
Tracking URI: azureml:URI
Artifact URI: azureml:URI
World size: 16
local rank is 3 and world rank is 7
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_7/cifar-10-python.tar.gz
Failed download. Trying https -> http instead. Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data_7/cifar-10-python.tar.gz

  0%|          | 0/170498071 [00:00<?, ?it/s]
  0%|          | 509952/170498071 [00:00<00:34, 4973408.07it/s]
  3%|▎         | 5369856/170498071 [00:00<00:05, 30292895.97it/s]
  6%|▌         | 10510336/170498071 [00:00<00:04, 39722223.67it/s]
  9%|▉         | 15607808/170498071 [00:00<00:03, 44134683.65it/s]
 12%|█▏        | 20916224/170498071 [00:00<00:03, 47346544.93it/s]
 15%|█▌        | 25932800/170498071 [00:00<00:02, 48219144.33it/s]
 18%|█▊        | 31144960/170498071 [00:00<00:02, 49478225.78it/s]
 21%|██        | 36150272/170498071 [00:00<00:02, 49594746.51it/s]
 24%|██▍       | 41290752/170498071 [00:00<00:02, 50114658.72it/s]
 27%|██▋       | 46394368/170498071 [00:01<00:02, 50398280.09it/s]
 30%|███       | 51577856/170498071 [00:01<00:02, 50816752.04it/s]
 33%|███▎      | 56827904/170498071 [00:01<00:02, 51320451.98it/s]
 36%|███▋      | 61961216/170498071 [00:01<00:02, 50976453.02it/s]
 39%|███▉      | 67060736/170498071 [00:01<00:02, 50287582.50it/s]
 42%|████▏     | 72092672/170498071 [00:01<00:01, 50109766.08it/s]
 45%|████▌     | 77106176/170498071 [00:01<00:01, 49095767.34it/s]
 48%|████▊     | 82021376/170498071 [00:01<00:01, 48671364.32it/s]
 51%|█████     | 86892544/170498071 [00:01<00:01, 47545196.01it/s]
 54%|█████▍    | 91653120/170498071 [00:01<00:01, 47161788.29it/s]
 57%|█████▋    | 96468992/170498071 [00:02<00:01, 47452315.59it/s]
 59%|█████▉    | 101244928/170498071 [00:02<00:01, 47427661.77it/s]
 62%|██████▏   | 105990144/170498071 [00:02<00:01, 47410171.20it/s]
 65%|██████▍   | 110733312/170498071 [00:02<00:01, 46584731.58it/s]
 68%|██████▊   | 115395584/170498071 [00:02<00:01, 44587269.02it/s]
 70%|███████   | 119872512/170498071 [00:02<00:01, 43861346.97it/s]
 73%|███████▎  | 124271616/170498071 [00:02<00:01, 42668794.97it/s]
 75%|███████▌  | 128549888/170498071 [00:02<00:00, 42252685.08it/s]
 80%|███████▉  | 135838720/170498071 [00:02<00:00, 51047603.65it/s]
 85%|████████▍ | 144263168/170498071 [00:02<00:00, 60702249.91it/s]
 90%|████████▉ | 152733696/170498071 [00:03<00:00, 67747468.16it/s]
 94%|█████████▍| 161063936/170498071 [00:03<00:00, 72342864.86it/s]
 99%|█████████▉| 169494528/170498071 [00:03<00:00, 75891572.88it/s]
170499072it [00:03, 51834695.76it/s]                              
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
Extracting ./data_7/cifar-10-python.tar.gz to ./data_7
Files already downloaded and verified
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/opt/conda/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/opt/conda/lib/python3.8/http/client.py", line 1418, in connect
    super().connect()
  File "/opt/conda/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/opt/conda/lib/python3.8/socket.py", line 787, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/opt/conda/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 93, in <module>
    model = torchvision.models.resnet50(pretrained=True)
  File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 331, in resnet50
    return _resnet("resnet50", Bottleneck, [3, 4, 6, 3], pretrained, progress, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 296, in _resnet
    state_dict = load_state_dict_from_url(model_urls[arch], progress=progress)
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 591, in load_state_dict_from_url
    download_url_to_file(url, cached_file, hash_prefix, progress=progress)
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 457, in download_url_to_file
    u = urlopen(req)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/opt/conda/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.8/urllib/request.py", line 1397, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/opt/conda/lib/python3.8/urllib/request.py", line 1357, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

The code is:

import time
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import mlflow
import os
import datetime

import configparser
import logging
import argparse

from PIL import Image

import ssl
ssl._create_default_https_context = ssl._create_unverified_context


start_time = time.time()

torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True


print("NCCL version is: ", torch.cuda.nccl.version())
print("MLflow version:", mlflow.__version__)
print("Tracking URI:", mlflow.get_tracking_uri())
print("Artifact URI:", mlflow.get_artifact_uri())

# Set the seed for reproducibility
torch.manual_seed(42)

# Set up the data loading parameters
batch_size = 128
num_epochs = 10
num_workers = 4
pin_memory = True

# Get the world size and rank to determine the process group
world_size = int(os.environ['WORLD_SIZE'])
world_rank = int(os.environ['RANK'])
local_rank = int(os.environ['LOCAL_RANK'])

print("World size:", world_size)
print("local rank is {} and world rank is {}".format(local_rank, world_rank))

is_distributed = world_size > 1

if is_distributed:
    batch_size = batch_size // world_size
    batch_size = max(batch_size, 1)

# Set the backend to NCCL for distributed training
dist.init_process_group(backend="nccl",
                        init_method="env://",
                        world_size=world_size,
                        rank=world_rank)

# Set the device to the current local rank
torch.cuda.set_device(local_rank)
device = torch.device('cuda', local_rank)

dist.barrier()

# Define the transforms for the dataset
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
])

# Load the CIFAR-10 dataset

data_root = './data_' + str(world_rank)
train_dataset = torchvision.datasets.CIFAR10(root=data_root, train=True, download=True, transform=transform_train)
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset=train_dataset, num_replicas=world_size, rank=world_rank, shuffle=True) if is_distributed else None
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=(train_sampler is None), num_workers=num_workers, pin_memory=pin_memory, sampler=train_sampler)

test_dataset = torchvision.datasets.CIFAR10(root=data_root, train=False, download=True, transform=transform_test)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=pin_memory)

# Define the ResNet50 model
model = torchvision.models.resnet50(pretrained=True)
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)

# Move the model to the GPU
model = model.to(device)

# Wrap the model with DistributedDataParallel
if is_distributed:
    model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model for the specified number of epochs
for epoch in range(num_epochs):
    running_loss = 0.0
    train_sampler.set_epoch(epoch) ### why is this line necessary??
    for batch_idx, (inputs, labels) in enumerate(train_loader):
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()

        optimizer.step()

        running_loss += loss.item()

    print('[Epoch %d] loss: %.3f' % (epoch + 1, running_loss))
    if world_rank == 0:
        # Log the loss and running loss as MLFlow metrics
        mlflow.log_metric("loss", loss.item())
        mlflow.log_metric("running loss", running_loss)

dist.barrier()
# Save the trained model
if world_rank == 0:
    checkpoints_path = "train_checkpoints"
    os.makedirs(checkpoints_path, exist_ok=True)
    torch.save(model.state_dict(), '{}/{}-{}.pth'.format(checkpoints_path, 'resnet50_cifar10', world_rank))
    mlflow.pytorch.log_model(model, "resnet50_cifar10_{}.pth".format(world_rank))
    # mlflow.log_artifact('{}/{}-{}.pth'.format(checkpoints_path, 'resnet50_cifar10', world_rank), artifact_path="model_state_dict")

# Evaluate the model on the test set and save inference on 6 random images
correct = 0
total = 0
with torch.no_grad():
    fig, axs = plt.subplots(2, 3, figsize=(8, 6), dpi=100)
    axs = axs.flatten()
    count = 0
    for data in test_loader:
        if count == 6:
            break
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

        # Save the inference on the 6 random images
        if count < 6:
            image = np.transpose(inputs[0].cpu().numpy(), (1, 2, 0))
            confidence = torch.softmax(outputs, dim=1)[0][predicted[0]].cpu().numpy()
            class_name = test_dataset.classes[predicted[0]]
            axs[count].imshow(image)
            axs[count].set_title(f'Class: {class_name}\nConfidence: {confidence:.2f}')
            axs[count].axis('off')
            count += 1

# Average the test accuracy across all processes

correct = torch.tensor(correct, dtype=torch.int8)
correct = correct.to(device)
torch.distributed.all_reduce(correct, op=torch.distributed.ReduceOp.SUM)
total = torch.tensor(total, dtype=torch.torch.int8)
total = total.to(device)
torch.distributed.all_reduce(total, op=torch.distributed.ReduceOp.SUM)
test_accuracy = 100 * correct / total
test_accuracy /= world_size

print('Test accuracy: %.2f %%' % test_accuracy)

# Save the plot with the 6 random images and their predicted classes and prediction confidence
test_img_file_name = 'test_images_' + str(world_rank) + '.png'
plt.savefig(test_img_file_name)

# Log the test accuracy and elapsed time to MLflow
if world_rank == 0:
    mlflow.log_metric("test accuracy", test_accuracy)

end_time = time.time()
elapsed_time = end_time - start_time
print('Elapsed time: ', elapsed_time)
if world_rank == 0:
    mlflow.log_metric("elapsed time", elapsed_time)

# Save the plot with the 6 random images and their predicted classes and prediction confidence as an artifact in MLflow
image = Image.open(test_img_file_name)
image = image.convert('RGBA')
image_buffer = np.array(image)
image_buffer = image_buffer[:, :, [2, 1, 0, 3]]
image_buffer = np.ascontiguousarray(image_buffer)
artifact_file_name = "inference_on_test_images_" + str(world_rank) + ".png"
mlflow.log_image(image_buffer, artifact_file=artifact_file_name)

# End the MLflow run
if mlflow.active_run():
    mlflow.end_run()

dist.destroy_process_group()

Could you please guide me as how to fix this error?