CUDA not available in PyTorch (but is in nvidia-smi)

I’m trying to run a PyTorch job through AWS ECS (just running a docker container inside EC2) but I receive the following error:

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.

I use the GPU ECS AMI (ami-0180e79579e32b7e6) together with the 19.09 Nvidia Pytorch docker image
The weird thing that throws me off is that the nvidia-smi command tells me everything is fine with CUDA:

============= 09:59:21 == PyTorch == == PyTorch == 09:59:21
============= 09:59:21 NVIDIA Release 19.09 (build 7911588) 09:59:21 PyTorch Version 1.2.0a0+afb7a16 09:59:21 Container image Copyright © 2019, NVIDIA CORPORATION. All rights reserved. 09:59:21 Copyright © 2014-2019 Facebook Inc. 09:59:21 Copyright © 2011-2014 Idiap Research Institute (Ronan Collobert) 09:59:21 Copyright © 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) 09:59:21 Copyright © 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) 09:59:21 Copyright © 2011-2013 NYU (Clement Farabet) 09:59:21 Copyright © 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) 09:59:21 Copyright © 2006 Idiap Research Institute (Samy Bengio) 09:59:21 Copyright © 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) 09:59:21 Copyright © 2015 Google Inc. 09:59:21 Copyright © 2015 Yangqing Jia 09:59:21 Copyright © 2013-2016 The Caffe contributors 09:59:21 All rights reserved. 09:59:21 Various files include modifications © NVIDIA CORPORATION. All rights reserved. 09:59:21 NVIDIA modifications are covered by the license terms that apply to the underlying project or file. 09:59:22 NOTE: MOFED driver for multi-node communication was not detected. 09:59:22 Multi-node communication performance may be reduced. 09:59:22 NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be 09:59:22 insufficient for PyTorch. NVIDIA recommends the use of the following flags: 09:59:22 nvidia-docker run --ipc=host … 09:59:22 Fri Sep 27 09:59:22 201909:59:22 ±----------------------------------------------------------------------------+ 09:59:22 | NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1 | 09:59:22 |-------------------------------±---------------------±---------------------+ 09:59:22 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 09:59:22 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 09:59:22 |===============================+======================+======================| 09:59:22 | 0 Tesla V100-SXM2… On | 00000000:00:1E.0 Off | 0 | 09:59:22 | N/A 41C P0 41W / 300W | 0MiB / 16130MiB | 6% Default | 09:59:22 ±------------------------------±---------------------±---------------------+ 09:59:22
09:59:22 ±----------------------------------------------------------------------------+ 09:59:22 | Processes: GPU Memory | 09:59:22 | GPU PID Type Process name Usage | 09:59:22 |=============================================================================| 09:59:22 | No running processes found | 09:59:22 ±----------------------------------------------------------------------------+

Plus it also runs fine locally with my GTX 1070. Issue can’t be that the Docker image has CUDA 10.1 right? Would be weird if Nvidia would create a PyTorch docker image that doesn’t actually work…
Any help is greatly appreciated :smiley:

Could you try to create a simple dummy CUDATensor inside our container on the AWS machine and check, if it’s working?
If so, some issue might come from the deserialization, although I’m not sure, why it should be the case.
Do you mean the 19.09 container is working fine on your local machine with a GTX 1070?

Actually, it’s the AWS AMI’s fault, I used the one that they display on https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html instead of the one in the AWS marketplace.