Trying to pass too many CPU scalars to CUDA kernel!

captainvera · July 2, 2020, 4:16pm

Hello,

I have been trying to use either pytorch 1.6.0 rc or pytorch nightly (Currently torch-1.7.0.dev20200702+cu101) in order to get access to native 16 bit precision support.

On the implementation side I am using pytorch-lightning (0.8.4) to do all the training boiler plate code.

Whenever I try to run any training loop, as I go through the validation_epoch_end (code below) I get the following error:

 File "/mnt/shared/home/mvera/.virtualenvs/VMQE/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 979, in fit
    self.single_gpu_train(model)
  File "/mnt/shared/home/mvera/.virtualenvs/VMQE/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 185, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/mnt/shared/home/mvera/.virtualenvs/VMQE/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1139, in run_pretrain_routine
    False)
  File "/mnt/shared/home/mvera/.virtualenvs/VMQE/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 342, in _evaluate
    eval_results = model.validation_epoch_end(outputs)
  File "/mnt/shared/home/mvera/redacted.py", line 365, in validation_epoch_end
    losses[loss] += value
RuntimeError: Trying to pass too many CPU scalars to CUDA kernel!

I can find absolutely no threads/issues on this error anywhere. Could you provide me with some pointers on how to solve this?

Extra information:
I am basically running a HugginFace Transformers model with a FF on top for sentence classification. Which HuggingFace model doesn’t seem to matter.

Code for validation_epoch_end is:

        losses = defaultdict(lambda: torch.tensor(0.0))
        for output in outputs:
            for loss, value in output['val_losses'].items():
                losses[loss] += value
        for loss in losses:
            losses[loss] /= len(outputs)

Any pointer for places/things to look for would be great.

Thanks in advance!

albanD · July 2, 2020, 6:24pm

Hey,

This is interesting.
Can you run with nightly build and set TORCH_SHOW_CPP_STACKTRACES=1 env variable to get more informations?

captainvera · July 3, 2020, 11:47am

Hey,

Running on pytorch 1.7.0.dev20200702+cu101 with TORCH_SHOW_CPP_STACKTRACES=1.

I get this:

RuntimeError: Trying to pass too many CPU scalars to CUDA kernel!
Exception raised from compute_types at /pytorch/aten/src/ATen/native/TensorIterator.cpp:225 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f5e1f0601e2 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::TensorIterator::compute_types(at::TensorIteratorConfig const&) + 0xe12 (0x7f5e638d7382 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: at::TensorIterator::build(at::TensorIteratorConfig&) + 0x6b (0x7f5e638d9f6b in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorIterator::TensorIterator(at::TensorIteratorConfig&) + 0xdd (0x7f5e638da5dd in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x14a (0x7f5e638da78a in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::native::add_out(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar) + 0x33 (0x7f5e63611213 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xf3efc2 (0x7f5e20407fc2 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x2f8cbe8 (0x7f5e659d4be8 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x240308 (0x7f5e732d3308 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x240cb6 (0x7f5e732d3cb6 in /home/ubuntu/.virtualenvs/pytorch1.6/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: _PyMethodDef_RawFastCallDict + 0x24d (0x557b0e058bfd in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #11: _PyCFunction_FastCallDict + 0x21 (0x557b0e058d81 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #12: <unknown function> + 0x17d029 (0x557b0e09e029 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #13: <unknown function> + 0x204e4c (0x557b0e125e4c in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #14: PyNumber_InPlaceAdd + 0x2e4 (0x557b0e06a1c4 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #15: _PyEval_EvalFrameDefault + 0x13ca (0x557b0e0eda7a in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #16: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #17: _PyEval_EvalFrameDefault + 0x4b39 (0x557b0e0f11e9 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x557b0e036389 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #19: _PyFunction_FastCallKeywords + 0x325 (0x557b0e08a255 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #20: _PyEval_EvalFrameDefault + 0x690 (0x557b0e0ecd40 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #21: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #22: _PyEval_EvalFrameDefault + 0x690 (0x557b0e0ecd40 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #23: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #24: _PyEval_EvalFrameDefault + 0x690 (0x557b0e0ecd40 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #25: _PyEval_EvalCodeWithName + 0x2f9 (0x557b0e036389 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #26: _PyFunction_FastCallKeywords + 0x325 (0x557b0e08a255 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #27: _PyEval_EvalFrameDefault + 0x690 (0x557b0e0ecd40 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #28: _PyEval_EvalCodeWithName + 0x2f9 (0x557b0e036389 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #29: _PyFunction_FastCallKeywords + 0x325 (0x557b0e08a255 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #30: _PyEval_EvalFrameDefault + 0x416 (0x557b0e0ecac6 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #31: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #32: _PyEval_EvalFrameDefault + 0x4b39 (0x557b0e0f11e9 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #33: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #34: _PyEval_EvalFrameDefault + 0x4b39 (0x557b0e0f11e9 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #35: _PyFunction_FastCallKeywords + 0xfb (0x557b0e08a02b in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #36: _PyEval_EvalFrameDefault + 0x416 (0x557b0e0ecac6 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #37: _PyEval_EvalCodeWithName + 0x2f9 (0x557b0e036389 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #38: PyEval_EvalCodeEx + 0x44 (0x557b0e0372b4 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #39: PyEval_EvalCode + 0x1c (0x557b0e0372dc in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #40: <unknown function> + 0x22c664 (0x557b0e14d664 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #41: PyRun_FileExFlags + 0xa1 (0x557b0e157a91 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #42: PyRun_SimpleFileExFlags + 0x1c3 (0x557b0e157c83 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #43: <unknown function> + 0x237db5 (0x557b0e158db5 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #44: _Py_UnixMain + 0x3c (0x557b0e158edc in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)
frame #45: __libc_start_main + 0xe7 (0x7f5e76bcab97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #46: <unknown function> + 0x1db3e0 (0x557b0e0fc3e0 in /home/ubuntu/.virtualenvs/pytorch1.6/bin/python3.7)

I hope this tells you more than it does me

ptrblck · July 4, 2020, 7:06am

I assume the error is raised in:

losses[loss] += value

Could you post the shapes of all tensors, which would reproduce this issue, please?

albanD · July 4, 2020, 7:59pm

I would guess two 0dim Tensors. And so the kernel tries to pass both as direct arguments hence the issue.
I’m sure @ngimel will know?

ngimel · July 5, 2020, 4:01am

@ptrblck is right, it’s coming from accumulating losses, a simple repro is

a=torch.tensor(2.)
b=torch.tensor(2., device="cuda")
a += b

The workaround is

a = a + b

(a would be on the gpu in this case)
I’ve opened an upstream issue to track

captainvera · July 8, 2020, 6:24pm

Sorry for my late reply.

This completely solved my problems and I’ve been using pytorch nightly for days with no hiccups.

Thanks a lot!