I’m getting the following RuntimeError
using apex optimizer FusedSGD
, (but I got it for all apex optimizers). I have not idea what it means, I checked, inputs, targets, loss, weights, all seems to be on the same cuda device.
Any idea ? Should I open a bug issue on apex github ?
I’m using EfficientDet model by Ross Wightman (efficientdet-pytorch)
...
...
File "mylib/apps/training/lib/models/detection/effdet/engine.py", line 96, in train_one_epoch
self.scaler.step(self.optimizer)
File "mylib/torchenv/lib/python3.6/site-packages/torch/cuda/amp/grad_scaler.py", line 321, in step
retval = optimizer.step(*args, **kwargs)
File "mylib/torchenv/lib/python3.6/site-packages/apex/optimizers/fused_sgd.py", line 222, in step
1.0/self.most_recent_scale)
File "mylib/torchenv/lib/python3.6/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__
*args)
RuntimeError: expected noop flag to be on the same device as tensors
Here’s my code to do the forward/backward step:
for images, targets in progress_bar(self.train_loader, parent=mb):
targets = {k: v.to(self.device) for k, v in targets.items()}
with torch.cuda.amp.autocast():
loss_dict = model(images, targets)
# Get loss values from dict
loss = loss_dict["loss"]
# Scales the loss, and calls backward()
# to create scaled gradients
scaler.scale(loss).backward()
# Unscales gradients and calls
# or skips optimizer.step()
scaler.step(optimizer)
# Updates the scale for next iteration
optimizer.zero_grad()
scaler.update()
and scaler is just scaler = torch.cuda.amp.GradScaler()