After running
!bash scripts/install.sh
!bash scripts/test_training.sh
, there’s a ton of text (1 order of magnitude more than I’m allowed to paste in here), but early cuda-specific things are:
Skipping wheel build for apex, due to binaries being disabled for it.
Installing collected packages: apex
Created temporary directory: /tmp/pip-record-rmsoy7_d
Running command /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-_mr01qdh/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-_mr01qdh/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-rmsoy7_d/install-record.txt --single-version-externally-managed --compile
torch.__version__ = 1.7.0+cu101
/tmp/pip-req-build-_mr01qdh/setup.py:67: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
warnings.warn("Option --pyprof not specified. Not installing PyProf dependencies!")
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
from /usr/local/cuda/bin
also many lines like this:
csrc/layer_norm_cuda.cpp:117:23: note: in expansion of macro ‘TORCH_CHECK’
#define CHECK_CUDA(x) TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor")
^~~~~~~~~~~
csrc/layer_norm_cuda.cpp:119:24: note: in expansion of macro ‘CHECK_CUDA’
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
^~~~~~~~~~
csrc/layer_norm_cuda.cpp:194:3: note: in expansion of macro ‘CHECK_INPUT’
CHECK_INPUT(mean);
The end of the output is:
python scripts/build_lmdb.py --config configs/unit_test/spade.yaml --paired --data_root dataset/unit_test/raw/spade/ --output_root dataset/unit_test/lmdb/spade --overwrite >> /tmp/unit_test.log [Success]
2020-11-03 07:18:53.054483: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
100% 548M/548M [00:06<00:00, 83.8MB/s]
Traceback (most recent call last):
File "train.py", line 93, in <module>
main()
File "train.py", line 72, in main
for it, data in enumerate(train_data_loader):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/datasets/paired_videos.py", line 302, in __getitem__
return self._getitem(index, concat=True)
File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/datasets/paired_videos.py", line 249, in _getitem
data, is_flipped = self.perform_augmentation(data, paired=True)
File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/datasets/base.py", line 318, in perform_augmentation
aug_inputs, paired=paired)
File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/utils/data.py", line 383, in perform_augmentation
return self._perform_paired_augmentation(inputs)
File "/content/gdrive/My Drive/IMAGINAIRE-colab/imaginaire/imaginaire/utils/data.py", line 314, in _perform_paired_augmentation
augmented = alb.ReplayCompose(
AttributeError: module 'albumentations' has no attribute 'ReplayCompose'
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--config', 'configs/unit_test/spade.yaml']' returned non-zero exit status 1.
python -m torch.distributed.launch --nproc_per_node=1 train.py --config configs/unit_test/spade.yaml >> /tmp/unit_test.log [Failure]