V1.6: Cuda compatibility issue on Google Colab

haseeb33 · July 29, 2020, 1:52am

Up until 2020-07-28T15:00:00Z, compatibility issues:
I want to use torchvision.models.detection.maskrcnn_resnet50_fpn() with argument trainable_backbone_layers which is only available in v1.6 (latest version).

Minimum cuda compatibility for v1.6 is cuda >= 10.2 but google colab has default cuda=10.1 installed. If I upgrade cuda to the latest version which is 11.0 then I experience issues with mxnet library.

How can I solve this issue?
If somehow I able to run cuda = 11.0 without having issues with mxnet then do I have to install these packages every time I start my google colab notebook? isn’t there a way to do this step only once?

P.S: Default torch version in google colab is v=1.5 so I have to install latest one also everytime.

terekita · October 31, 2020, 6:43am

Having similar issue in trying to use NVIDIA Imaginaire package. I need to have cuda 10.2 installed on Google Colab, but after many hours attempting various things, and extensive googling, I can’t figure out how to get cuda 10.2 installed on Colab. Were you ever able to resolve this?

ptrblck · October 31, 2020, 9:29am

Do you have to install CUDA10.2 directly or only the PyTorch binaries with the CUDA10.2 runtime?
I’m not sure, if the former use case is even possible, since you are only using a Python notebook and want to install the CUDA lib on the server/vm.
The latter should be possible by using the install instructions in the notebook.

terekita · October 31, 2020, 8:24pm

Thanks very much for your reply!

I don’t know if 10.2 has to be installed directly for Imaginaire to run on Colab. Therefore, I’m attempting to use PyTorch binaries with CUDA 10.2 runtime. However, I can’t get this latter working, either.

To try to install the 10.2 runtime, I ran the instructions from the notebook you linked, i.e., I ran this command:

conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

Unfortunately, when I check with:
!nvcc --version

or with:
!cat /usr/local/cuda/version.txt
or
!dpkg -l | grep cuda

I only see reference to versions below 10.2. Should I perhaps try to install a different way, and/or should I be checking the runtime version differently?

Thanks very much.

ptrblck · November 1, 2020, 5:16am

The conda binary will only install the CUDA runtime in the current conda environment, not a full CUDA toolkit in /usr/local/cuda.
After installing it, you can check the CUDA version used in the PyTorch binaries via:

print(torch.version.cuda)

terekita · November 1, 2020, 6:55am

Thank you very much for that information. When I run print(torch.version.cuda) after executing the conda command, it looks like I’m still on 10.1.

For what it’s worth, this is the notebook I’m working in to try to get this running:

ptrblck · November 1, 2020, 7:14am

I can install PyTorch 1.7.0 with the CUDA10.2 runtime in a new Colab notebook.
After creating a new notebook, it seems 1.6 with CUDA10.1 is installed.
Uninstalling this wheel and installing the new one via:

!pip uninstall torch torchvision -y
!pip install torch torchvision

results in:

import torch
print(torch.__version__)
> 1.7.0
print(torch.version.cuda)
> 10.2

terekita · November 1, 2020, 5:43pm

That’s awesome (and I verified this on my end, too)! Thanks so much for patiently helping me through this, I really appreciate it! Now, on to installing Imaginaire…

terekita · November 1, 2020, 6:26pm

Keeping this in the current thread because it still relates to cuda 10.2

Was not able to install Imaginaire on Colab successfully yet. I executed

!pip uninstall torch torchvision -y
!pip install torch torchvision

and verified torch.version.cuda results in 10.2.

Running the test scripts from Imaginaire (after installation) results in errors that show that 10.2 is not being used by the installation. Any ideas about how to get the Imaginaire build to use the runtime cuda? Errors below:

    writing /tmp/pip-req-build-j0qic0rq/pip-egg-info/apex.egg-info/PKG-INFO
    writing dependency_links to /tmp/pip-req-build-j0qic0rq/pip-egg-info/apex.egg-info/dependency_links.txt
    writing top-level names to /tmp/pip-req-build-j0qic0rq/pip-egg-info/apex.egg-info/top_level.txt
    writing manifest file '/tmp/pip-req-build-j0qic0rq/pip-egg-info/apex.egg-info/SOURCES.txt'
    writing manifest file '/tmp/pip-req-build-j0qic0rq/pip-egg-info/apex.egg-info/SOURCES.txt'
    /usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    /tmp/pip-req-build-j0qic0rq/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
      warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
  Source in /tmp/pip-req-build-j0qic0rq has version 0.1, which satisfies requirement apex==0.1 from file:///tmp/apex
  Removed apex==0.1 from file:///tmp/apex from build tracker '/tmp/pip-req-tracker-ps1dpr20'
Skipping wheel build for apex, due to binaries being disabled for it.
Installing collected packages: apex
  Created temporary directory: /tmp/pip-record-4vkocx4g
    Running command /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-j0qic0rq/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-j0qic0rq/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-4vkocx4g/install-record.txt --single-version-externally-managed --compile
    /usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 10010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
      return torch._C._cuda_getDeviceCount() > 0
    No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda```

ptrblck · November 1, 2020, 10:08pm

If you need to compile the package, you would still need a local CUDA installation and I doubt you will be able to install it in Colab through the notebook (I might be wrong).
Thus I think you would need to use a proper server and either install CUDA directly on it or via docker etc.

terekita · November 1, 2020, 10:40pm

Thanks so much for your replies on this thread!

I was afraid it might come down to that (spent a lot of money on server time already this year, and was really happy to be getting so much work done on colab pro for thousands less…) If I happen to somehow discover otherwise at any point, I’ll post back on this thread. Thanks again!

ptrblck · November 2, 2020, 6:43am

Sure, let me know if you find a workaround.
Btw. I skimmed again through the repository and cannot find any hard requirement for CUDA10.2.
Since it seems the nodes running Colab notebooks are preinstalled with CUDA10.1, wouldn’t it work to install the matching PyTorch CUDA10.1 binary and build it?
What kind of error are you seeing?

terekita · November 2, 2020, 5:45pm

Thank you taking another look through the repository, and I’m excited to try installing the matching 10.1 binary!

I’d only gotten as far as the errors that I posted earlier in this thread (generated while running the test script after installation).

terekita · November 2, 2020, 8:47pm

I’m not sure I understand how to install the matching PyTorch / Cuda 10.1 binary.

Does this look right?:

!pip install torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

The notebook where I’m attempting this is here. The test script fails, as do attempts to run the MUNIT training.

ptrblck · November 3, 2020, 6:35am

The install command looks alright. What kind of error message are you seeing?

terekita · November 3, 2020, 7:31am

After running

!bash scripts/install.sh
!bash scripts/test_training.sh

, there’s a ton of text (1 order of magnitude more than I’m allowed to paste in here), but early cuda-specific things are:

Skipping wheel build for apex, due to binaries being disabled for it.
Installing collected packages: apex
  Created temporary directory: /tmp/pip-record-rmsoy7_d
    Running command /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-_mr01qdh/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-_mr01qdh/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-rmsoy7_d/install-record.txt --single-version-externally-managed --compile


    torch.__version__  = 1.7.0+cu101


    /tmp/pip-req-build-_mr01qdh/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
      warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")

    Compiling cuda extensions with
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2019 NVIDIA Corporation
    Built on Sun_Jul_28_19:07:16_PDT_2019
    Cuda compilation tools, release 10.1, V10.1.243
    from /usr/local/cuda/bin

also many lines like this:

csrc/layer_norm_cuda.cpp:117:23: note: in expansion of macro ‘TORCH_CHECK’
     #define CHECK_CUDA(x) TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor")
                           ^~~~~~~~~~~
    csrc/layer_norm_cuda.cpp:119:24: note: in expansion of macro ‘CHECK_CUDA’
     #define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
                            ^~~~~~~~~~
    csrc/layer_norm_cuda.cpp:194:3: note: in expansion of macro ‘CHECK_INPUT’
       CHECK_INPUT(mean);

The end of the output is:

python scripts/build_lmdb.py --config configs/unit_test/spade.yaml --paired --data_root dataset/unit_test/raw/spade/ --output_root dataset/unit_test/lmdb/spade --overwrite >> /tmp/unit_test.log  [Success] 
2020-11-03 07:18:53.054483: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
100% 548M/548M [00:06<00:00, 83.8MB/s]
Traceback (most recent call last):
  File "train.py", line 93, in <module>
    main()
  File "train.py", line 72, in main
    for it, data in enumerate(train_data_loader):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/datasets/paired_videos.py", line 302, in __getitem__
    return self._getitem(index, concat=True)
  File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/datasets/paired_videos.py", line 249, in _getitem
    data, is_flipped = self.perform_augmentation(data, paired=True)
  File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/datasets/base.py", line 318, in perform_augmentation
    aug_inputs, paired=paired)
  File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/utils/data.py", line 383, in perform_augmentation
    return self._perform_paired_augmentation(inputs)
  File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/utils/data.py", line 314, in _perform_paired_augmentation
    augmented = alb.ReplayCompose(
AttributeError: module 'albumentations' has no attribute 'ReplayCompose'

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--config', 'configs/unit_test/spade.yaml']' returned non-zero exit status 1.
 python -m torch.distributed.launch --nproc_per_node=1 train.py  --config configs/unit_test/spade.yaml >> /tmp/unit_test.log  [Failure]

ptrblck · November 3, 2020, 9:25am

The first outputs are just warning and you can ignore them.
The code is crashing, as albumentations fails with ReplayCompose, so it seems to be unrelated to CUDA10.1 vs. CUDA10.2.

terekita · November 3, 2020, 5:01pm

So helpful to know it’s not a 10.1 v 10.2 thing, thank you very much, I’ll start trying hunt down the crashing. Much appreciated.