Socket.gaierror: [Errno -3] Temporary failure in name resolution in vgg_full = models.vgg19(pretrained=pretrained).features

Is there a way to fix this?

The error is related to the following lines:

  File "train.py", line 1465, in <module>
    net = DopeNetwork(pretrained=opt.model_arch_pretrained).cuda(local_rank)
  File "train.py", line 137, in __init__
    vgg_full = models.vgg19(pretrained=pretrained).features

Here’s the packages I am using:

albumentations==1.3.0
ConfigParser==5.3.0
horovod==0.27.0
matplotlib==3.7.0
numpy==1.24.2
nvisii==1.1.72
Pillow==9.4.0
profiling==0.1.3
psutil==5.9.0
pyquaternion==0.9.9
pyrealsense2==2.53.1.4623
pyrender==0.1.45
pyrr==0.10.3
PyYAML==6.0
scipy==1.10.1
seaborn==0.12.2
simplejson==3.18.4
tensorboardX==2.6
tqdm==4.64.1
opencv-python-headless==4.1.2.30
torch==1.13.0
torchvision==0.14.0
mlflow==2.3.2
azureml-mlflow==1.50.0
CPython
3.8.10
uname_result(system='Linux', node='number', release='5.15.0-1029-azure', version='#36~20.04.1-Ubuntu SMP Tue Dec 6 17:00:26 UTC 2022', machine='x86_64', processor='x86_64')
System information: Linux #36~20.04.1-Ubuntu SMP Tue Dec 6 17:00:26 UTC 2022
Python version: 3.8.10
MLflow version: 2.3.2
MLflow module location: /usr/local/lib/python3.8/dist-packages/mlflow/__init__.py
Tracking URI: URI
Registry URI: URI
MLflow environment variables:
  MLFLOW_DISABLE_ENV_MANAGER_CONDA_WARNING: True
  MLFLOW_EXPERIMENT_ID: 021dec90-c7d3-4233-813e-799d15e43f9a
  MLFLOW_EXPERIMENT_NAME: dev_DOPE_FAT_test2
  MLFLOW_RUN_ID: 54be70c6-c1a9-4433-8943-e1678667884b
  MLFLOW_TRACKING_TOKEN: token
  MLFLOW_TRACKING_URI: URI
MLflow dependencies:
  Flask: 2.3.2
  Jinja2: 3.1.2
  alembic: 1.11.1
  click: 8.1.3
  cloudpickle: 2.2.0
  databricks-cli: 0.17.7
  docker: 6.1.2
  entrypoints: 0.4
  gitpython: 3.1.31
  gunicorn: 20.1.0
  importlib-metadata: 5.1.0
  markdown: 3.4.1
  matplotlib: 3.7.0
  numpy: 1.24.2
  packaging: 22.0
  pandas: 1.5.2
  protobuf: 3.20.1
  pyarrow: 9.0.0
  pytz: 2022.6
  pyyaml: 6.0
  querystring-parser: 1.2.4
  requests: 2.28.1
  scikit-learn: 0.24.2
  scipy: 1.10.1
  sqlalchemy: 2.0.15
  sqlparse: 0.4.4
2023/05/23 15:48:00 WARNING mlflow.tracking.fluent: Exception raised while enabling autologging for sklearn: No module named 'sklearn.utils.testing'
517cc60b90264a01ac05aafd5e6ca61e000002:48:48 [3] NCCL INFO cudaDriverVersion 11040
517cc60b90264a01ac05aafd5e6ca61e000002:48:48 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
517cc60b90264a01ac05aafd5e6ca61e000002:48:48 [3] NCCL INFO Bootstrap : Using eth0:10.0.0.6<0>
517cc60b90264a01ac05aafd5e6ca61e000002:48:48 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
517cc60b90264a01ac05aafd5e6ca61e000002:48:48 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
517cc60b90264a01ac05aafd5e6ca61e000002:48:48 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
517cc60b90264a01ac05aafd5e6ca61e000002:48:48 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO P2P plugin IBext
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO NET/IB : No device found.
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.6<0>
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Using network Socket
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531303244/pci0001:00/0001:00:00.0/../max_link_width, ignoring
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531303244/pci0002:00/0002:00:00.0/../max_link_width, ignoring
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_speed, ignoring
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531303244/pci0003:00/0003:00:00.0/../max_link_width, ignoring
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_speed, ignoring
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0004-0000-3130-444531303244/pci0004:00/0004:00:00.0/../max_link_width, ignoring
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/6045bd7e-620d-6045-bd7e-620d6045bd7e is not a PCI device (vmbus). Attaching to first CPU
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO === System : maxBw 5.0 totalBw 12.0 ===
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO CPU/0 (1/1/1)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO + PCI[5000.0] - NIC/0
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO                 + NET[5.0] - NET/0 (0/0/5.000000)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO + PCI[12.0] - GPU/100000 (4)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO + PCI[12.0] - GPU/200000 (5)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO + PCI[12.0] - GPU/300000 (6)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO + PCI[12.0] - GPU/400000 (7)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO ==========================================
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO GPU/100000 :GPU/100000 (0/5000.000000/LOC) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO GPU/200000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (0/5000.000000/LOC) GPU/300000 (2/12.000000/PHB) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO GPU/300000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (0/5000.000000/LOC) GPU/400000 (2/12.000000/PHB) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO GPU/400000 :GPU/100000 (2/12.000000/PHB) GPU/200000 (2/12.000000/PHB) GPU/300000 (2/12.000000/PHB) GPU/400000 (0/5000.000000/LOC) CPU/0 (1/12.000000/PHB) NET/0 (3/5.000000/PHB)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO NET/0 :GPU/100000 (3/5.000000/PHB) GPU/200000 (3/5.000000/PHB) GPU/300000 (3/5.000000/PHB) GPU/400000 (3/5.000000/PHB) CPU/0 (2/5.000000/PHB) NET/0 (0/5000.000000/LOC)
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Setting affinity for GPU 3 to 0fff
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 5.000000/5.000000, type PHB/PHB, sameChannels 1
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 6.000000/5.000000, type PHB/PHB, sameChannels 1
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type NVL/PIX, sameChannels 1
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Ring 00 : 6 -> 7 -> 8
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Ring 01 : 6 -> 7 -> 8
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Channel 00/0 : 7[400000] -> 8[100000] [send] via NET/Socket/0
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Channel 01/0 : 7[400000] -> 8[100000] [send] via NET/Socket/0
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Connected all rings
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Channel 00 : 7[400000] -> 6[300000] via SHM/direct/direct
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Channel 01 : 7[400000] -> 6[300000] via SHM/direct/direct
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO Connected all trees
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p chaDownloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
training script path:  /mnt/azureml/cr/j/aa54f8a6f177420ea912cde147fec47b/exe/wd
start: 15:48:00.595235
object of interest is: cracker
manual seed set to 2236
opt.checkpoints = /mnt/azureml/cr/j/aa54f8a6f177420ea912cde147fec47b/cap/data-capability/wd/checkpoints
world size is:  16
global rank is 7 and local_rank is 3
is_distributed is True and batch_size is 1
os.getpid() is 48 and initializing process group with {'MASTER_ADDR': '10.0.0.4', 'MASTER_PORT': '6105', 'LOCAL_RANK': '3', 'RANK': '7', 'WORLD_SIZE': '16'}
device is cuda:3
load data
train data size:  246000
training data len:  246000
batch size is:  1
training data: 15375 batches
load models
torch.cuda.device_count():  4
type opt.gpuids: <class 'list'>
gpuids are: [0, 1, 2, 3]
Training network pretrained on imagenet.
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/usr/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/usr/lib/python3.8/http/client.py", line 1418, in connect
    super().connect()
  File "/usr/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/usr/lib/python3.8/socket.py", line 787, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 1465, in <module>
    net = DopeNetwork(pretrained=opt.model_arch_pretrained).cuda(local_rank)
  File "train.py", line 137, in __init__
    vgg_full = models.vgg19(pretrained=pretrained).features
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py", line 142, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/_utils.py", line 228, in inner_wrapper
    return builder(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/vgg.py", line 467, in vgg19
    return _vgg("E", False, weights, progress, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/vgg.py", line 105, in _vgg
    model.load_state_dict(weights.get_state_dict(progress=progress))
  File "/usr/local/lib/python3.8/dist-packages/torchvision/models/_api.py", line 66, in get_state_dict
    return load_state_dict_from_url(self.url, progress=progress)
  File "/usr/local/lib/python3.8/dist-packages/torch/hub.py", line 731, in load_state_dict_from_url
    download_url_to_file(url, cached_file, hash_prefix, progress=progress)
  File "/usr/local/lib/python3.8/dist-packages/torch/hub.py", line 597, in download_url_to_file
    u = urlopen(req)
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1397, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/usr/lib/python3.8/urllib/request.py", line 1357, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -3] Temporary failure in name resolution>
nnels per peer
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO NCCL_P2P_PXN_LEVEL set by environment to 0.
517cc60b90264a01ac05aafd5e6ca61e000002:48:296 [3] NCCL INFO comm 0x2be8a6b0 rank 7 nranks 16 cudaDev 3 busId 400000 - Init COMPLETE
517cc60b90264a01ac05aafd5e6ca61e000002:48:300 [3] NCCL INFO [Service thread] Connection closed by localRank 3
517cc60b90264a01ac05aafd5e6ca61e000002:48:48 [3] NCCL INFO comm 0x2be8a6b0 rank 7 nranks 16 cudaDev 3 busId 400000 - Abort COMPLETE