I’m sorted, I don’t need help, the point of this post is so that the error message gets indexed by the search engines and people don’t waste time like I just have.
I have a version of pytorch-eg-wordLM where I have two optimisers, one for the word embeddings and one for everything else. I compile my model and train() function separately. torch.compile works wonderfully except in the case below:
model = torch.compile(RNNModel(...))
sparse = [ 'word', '_orig_mod.word' ]
optDense = torch.optim.AdamW([p for n, p in model.named_parameters() if n not in sparse], fused=True)
optEmbed = torch.optim.AdamW([p for n, p in model.named_parameters() if n in sparse])
def train(batch_size):
model.train()
hidden = model.init_hidden(batch_size)
for batch, i in enumerate(range(0, train_data.size(0) - arg.bptt, arg.bptt)):
data, targets = get_batch(train_data, i, device=device)
model.zero_grad()
hidden = repackage_hidden(hidden)
output, hidden, _ = model(data, hidden, targets=targets)
loss = criterALL(output, targets.long())
item = loss.item()
loss.backward()
optDense.step()
optEmbed.step()
print('working')
Which results in the error message:
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in method _foreach_addcdiv_ of type object at 0x7365f3a9f6c0>(*([Parameter(FakeTensor(..., device='cuda:0', size=(32766, 1920), requires_grad=True))], [FakeTensor(..., device='cuda:0', size=(32766, 1920))], (FakeTensor(..., device='cuda:0', size=(32766, 1920)),), FakeTensor(..., size=(1,))), **{}):
Expected scalars to be on CPU, got meta instead.
from user code:
File "/usr/lib/python3/dist-packages/torch/optim/optimizer.py", line 161, in maybe_fallback
return func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/optim/adamw.py", line 767, in adamw
func(
File "/usr/lib/python3/dist-packages/torch/optim/adamw.py", line 604, in _multi_tensor_adamw
torch._foreach_addcdiv_(
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
The answer of course is ‘well don’t do that then’. I had set fused=True in one optimiser and not in the other.
[ I can of course provide full (cut down) code if that is of interest to anyone ]