I am using Google GCP GPUs, and it appears the only machine image they provide is CUDA 11.0 (!). Only pytorch <= 1.7 supports CUDA 11.0.
I am creating a Dockerfile for my project. However, some of my library’s dependencies want pytorch 1.9, so they upgrade from pytorch 1.7 GPU version to pytorch 1.9 CPU version.
I think Pytorch 1.9 is a must. But then I am not sure what workaround is the least painful:
- Can I use CUDA 10.2 in the Docker, even though the bare metal system is CUDA 11.0? Or will that cause problems. I have seen conflicting advice on this.
- I could try to build my own Google machine image. This seems very painful tho.
- I could try to build pytorch 1.9 from scratch in my Docker and use CUDA 11.0 in my Docker. I haven’t found good Dockerfiles explaining how to do this.
Any other suggestions?
I don’t have an exact answer to your question, but I tried something similar as in your third option here: Torch CUDA unknown error, but CUDA and nvidia-smi properly installed on Azure K8s Service - PyTorch Forums. I’m using Microsoft Azure instead of Google cloud, but same principle for building a docker image. The example I have there correctly installs CUDA and
nvidia-smi, so that CUDA 11.4 is detected, but unfortunately pytorch can’t initialize CUDA properly for some reason, so I’d also be interested if you can find a solution using your third option.
- Yes, you should be able to use a CUDA10.2 docker container, as its driver requirement would be met by the newer CUDA11 driver.
- I don’t know what a Google machine image is and how hard it would be to build it.
- You could reuse the Dockerfile from the PyTorch repository.
Also, did you try to install the CUDA11.1 Pytorch 1.9.0 binaries? I don’t know which driver is installed on the node, but the CUDA enhanced compatibility might be used.