I’m trying to run a PyTorch job through AWS ECS (just running a docker container inside EC2) but I receive the following error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.
I use the GPU ECS AMI (ami-0180e79579e32b7e6) together with the 19.09 Nvidia Pytorch docker image
The weird thing that throws me off is that the nvidia-smi command tells me everything is fine with CUDA:
============= 09:59:21 == PyTorch == == PyTorch == 09:59:21
============= 09:59:21 NVIDIA Release 19.09 (build 7911588) 09:59:21 PyTorch Version 1.2.0a0+afb7a16 09:59:21 Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. 09:59:21 Copyright (c) 2014-2019 Facebook Inc. 09:59:21 Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) 09:59:21 Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) 09:59:21 Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) 09:59:21 Copyright (c) 2011-2013 NYU (Clement Farabet) 09:59:21 Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) 09:59:21 Copyright (c) 2006 Idiap Research Institute (Samy Bengio) 09:59:21 Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) 09:59:21 Copyright (c) 2015 Google Inc. 09:59:21 Copyright (c) 2015 Yangqing Jia 09:59:21 Copyright (c) 2013-2016 The Caffe contributors 09:59:21 All rights reserved. 09:59:21 Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. 09:59:21 NVIDIA modifications are covered by the license terms that apply to the underlying project or file. 09:59:22 NOTE: MOFED driver for multi-node communication was not detected. 09:59:22 Multi-node communication performance may be reduced. 09:59:22 NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be 09:59:22 insufficient for PyTorch. NVIDIA recommends the use of the following flags: 09:59:22 nvidia-docker run --ipc=host … 09:59:22 Fri Sep 27 09:59:22 201909:59:22 ±----------------------------------------------------------------------------+ 09:59:22 | NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1 | 09:59:22 |-------------------------------±---------------------±---------------------+ 09:59:22 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 09:59:22 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 09:59:22 |===============================+======================+======================| 09:59:22 | 0 Tesla V100-SXM2… On | 00000000:00:1E.0 Off | 0 | 09:59:22 | N/A 41C P0 41W / 300W | 0MiB / 16130MiB | 6% Default | 09:59:22 ±------------------------------±---------------------±---------------------+ 09:59:22
09:59:22 ±----------------------------------------------------------------------------+ 09:59:22 | Processes: GPU Memory | 09:59:22 | GPU PID Type Process name Usage | 09:59:22 |=============================================================================| 09:59:22 | No running processes found | 09:59:22 ±----------------------------------------------------------------------------+
Plus it also runs fine locally with my GTX 1070. Issue can’t be that the Docker image has CUDA 10.1 right? Would be weird if Nvidia would create a PyTorch docker image that doesn’t actually work…
Any help is greatly appreciated