Expected scalars to be on CPU, got meta instead

tonyr · September 27, 2024, 10:31am

I’m sorted, I don’t need help, the point of this post is so that the error message gets indexed by the search engines and people don’t waste time like I just have.

I have a version of pytorch-eg-wordLM where I have two optimisers, one for the word embeddings and one for everything else. I compile my model and train() function separately. torch.compile works wonderfully except in the case below:

model = torch.compile(RNNModel(...))

sparse = [ 'word', '_orig_mod.word' ]
optDense = torch.optim.AdamW([p for n, p in model.named_parameters() if n not in sparse], fused=True)
optEmbed = torch.optim.AdamW([p for n, p in model.named_parameters() if n in sparse])

def train(batch_size):
    model.train()
    hidden = model.init_hidden(batch_size)
    for batch, i in enumerate(range(0, train_data.size(0) - arg.bptt, arg.bptt)):
	data, targets = get_batch(train_data, i, device=device)
        model.zero_grad()
        hidden = repackage_hidden(hidden)
        output, hidden, _ = model(data, hidden, targets=targets)
        loss = criterALL(output, targets.long())
        item = loss.item()
        loss.backward()
        optDense.step()
        optEmbed.step()
        print('working')

Which results in the error message:

torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in method _foreach_addcdiv_ of type object at 0x7365f3a9f6c0>(*([Parameter(FakeTensor(..., device='cuda:0', size=(32766, 1920), requires_grad=True))], [FakeTensor(..., device='cuda:0', size=(32766, 1920))], (FakeTensor(..., device='cuda:0', size=(32766, 1920)),), FakeTensor(..., size=(1,))), **{}):
Expected scalars to be on CPU, got meta instead.

from user code:
   File "/usr/lib/python3/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/optim/adamw.py", line 767, in adamw
    func(
  File "/usr/lib/python3/dist-packages/torch/optim/adamw.py", line 604, in _multi_tensor_adamw
    torch._foreach_addcdiv_(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

The answer of course is ‘well don’t do that then’. I had set fused=True in one optimiser and not in the other.

[ I can of course provide full (cut down) code if that is of interest to anyone ]