Hello,
I have been trying to use either pytorch 1.6.0 rc or pytorch nightly (Currently torch-1.7.0.dev20200702+cu101
) in order to get access to native 16 bit precision support.
On the implementation side I am using pytorch-lightning (0.8.4
) to do all the training boiler plate code.
Whenever I try to run any training loop, as I go through the validation_epoch_end (code below) I get the following error:
File "/mnt/shared/home/mvera/.virtualenvs/VMQE/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 979, in fit
self.single_gpu_train(model)
File "/mnt/shared/home/mvera/.virtualenvs/VMQE/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 185, in single_gpu_train
self.run_pretrain_routine(model)
File "/mnt/shared/home/mvera/.virtualenvs/VMQE/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1139, in run_pretrain_routine
False)
File "/mnt/shared/home/mvera/.virtualenvs/VMQE/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 342, in _evaluate
eval_results = model.validation_epoch_end(outputs)
File "/mnt/shared/home/mvera/redacted.py", line 365, in validation_epoch_end
losses[loss] += value
RuntimeError: Trying to pass too many CPU scalars to CUDA kernel!
I can find absolutely no threads/issues on this error anywhere. Could you provide me with some pointers on how to solve this?
Extra information:
I am basically running a HugginFace Transformers model with a FF on top for sentence classification. Which HuggingFace model doesn’t seem to matter.
Code for validation_epoch_end
is:
losses = defaultdict(lambda: torch.tensor(0.0))
for output in outputs:
for loss, value in output['val_losses'].items():
losses[loss] += value
for loss in losses:
losses[loss] /= len(outputs)
Any pointer for places/things to look for would be great.
Thanks in advance!
1 Like
albanD
(Alban D)
July 2, 2020, 6:24pm
2
Hey,
This is interesting.
Can you run with nightly build and set TORCH_SHOW_CPP_STACKTRACES=1
env variable to get more informations?
Hey,
Running on pytorch 1.7.0.dev20200702+cu101
with TORCH_SHOW_CPP_STACKTRACES=1
.
I get this:
RuntimeError: Trying to pass too many CPU scalars to CUDA kernel!
Exception raised from compute_types at /pytorch/aten/src/ATen/native/TensorIterator.cpp:225 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f5e1f0601e2 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::TensorIterator::compute_types(at::TensorIteratorConfig const&) + 0xe12 (0x7f5e638d7382 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: at::TensorIterator::build(at::TensorIteratorConfig&) + 0x6b (0x7f5e638d9f6b in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorIterator::TensorIterator(at::TensorIteratorConfig&) + 0xdd (0x7f5e638da5dd in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x14a (0x7f5e638da78a in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::native::add_out(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar) + 0x33 (0x7f5e63611213 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xf3efc2 (0x7f5e20407fc2 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x2f8cbe8 (0x7f5e659d4be8 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x240308 (0x7f5e732d3308 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x240cb6 (0x7f5e732d3cb6 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: _PyMethodDef_RawFastCallDict + 0x24d (0x557b0e058bfd in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #11: _PyCFunction_FastCallDict + 0x21 (0x557b0e058d81 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #12: <unknown function> + 0x17d029 (0x557b0e09e029 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #13: <unknown function> + 0x204e4c (0x557b0e125e4c in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #14: PyNumber_InPlaceAdd + 0x2e4 (0x557b0e06a1c4 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #15: _PyEval_EvalFrameDefault + 0x13ca (0x557b0e0eda7a in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #16: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #17: _PyEval_EvalFrameDefault + 0x4b39 (0x557b0e0f11e9 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x557b0e036389 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #19: _PyFunction_FastCallKeywords + 0x325 (0x557b0e08a255 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #20: _PyEval_EvalFrameDefault + 0x690 (0x557b0e0ecd40 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #21: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #22: _PyEval_EvalFrameDefault + 0x690 (0x557b0e0ecd40 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #23: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #24: _PyEval_EvalFrameDefault + 0x690 (0x557b0e0ecd40 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #25: _PyEval_EvalCodeWithName + 0x2f9 (0x557b0e036389 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #26: _PyFunction_FastCallKeywords + 0x325 (0x557b0e08a255 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #27: _PyEval_EvalFrameDefault + 0x690 (0x557b0e0ecd40 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #28: _PyEval_EvalCodeWithName + 0x2f9 (0x557b0e036389 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #29: _PyFunction_FastCallKeywords + 0x325 (0x557b0e08a255 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #30: _PyEval_EvalFrameDefault + 0x416 (0x557b0e0ecac6 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #31: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #32: _PyEval_EvalFrameDefault + 0x4b39 (0x557b0e0f11e9 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #33: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #34: _PyEval_EvalFrameDefault + 0x4b39 (0x557b0e0f11e9 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #35: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #36: _PyEval_EvalFrameDefault + 0x416 (0x557b0e0ecac6 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #37: _PyEval_EvalCodeWithName + 0x2f9 (0x557b0e036389 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #38: PyEval_EvalCodeEx + 0x44 (0x557b0e0372b4 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #39: PyEval_EvalCode + 0x1c (0x557b0e0372dc in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #40: <unknown function> + 0x22c664 (0x557b0e14d664 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #41: PyRun_FileExFlags + 0xa1 (0x557b0e157a91 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #42: PyRun_SimpleFileExFlags + 0x1c3 (0x557b0e157c83 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #43: <unknown function> + 0x237db5 (0x557b0e158db5 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #44: _Py_UnixMain + 0x3c (0x557b0e158edc in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #45: __libc_start_main + 0xe7 (0x7f5e76bcab97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #46: <unknown function> + 0x1db3e0 (0x557b0e0fc3e0 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
I hope this tells you more than it does me
I assume the error is raised in:
losses[loss] += value
Could you post the shapes of all tensors, which would reproduce this issue, please?
albanD
(Alban D)
July 4, 2020, 7:59pm
5
I would guess two 0dim Tensors. And so the kernel tries to pass both as direct arguments hence the issue.
I’m sure @ngimel will know?
ngimel
(ngimel)
July 5, 2020, 4:01am
6
@ptrblck is right, it’s coming from accumulating losses, a simple repro is
a=torch.tensor(2.)
b=torch.tensor(2., device="cuda")
a += b
The workaround is
a = a + b
(a
would be on the gpu in this case)
I’ve opened an upstream issue to track
7 Likes
Sorry for my late reply.
This completely solved my problems and I’ve been using pytorch nightly for days with no hiccups.
Thanks a lot!