Noop flag device error with apex optimizers

I’m getting the following RuntimeError using apex optimizer FusedSGD, (but I got it for all apex optimizers). I have not idea what it means, I checked, inputs, targets, loss, weights, all seems to be on the same cuda device.
Any idea ? Should I open a bug issue on apex github ?

I’m using EfficientDet model by Ross Wightman (efficientdet-pytorch)

...
...
  File "mylib/apps/training/lib/models/detection/effdet/engine.py", line 96, in train_one_epoch
    self.scaler.step(self.optimizer)
  File "mylib/torchenv/lib/python3.6/site-packages/torch/cuda/amp/grad_scaler.py", line 321, in step
    retval = optimizer.step(*args, **kwargs)
  File "mylib/torchenv/lib/python3.6/site-packages/apex/optimizers/fused_sgd.py", line 222, in step
    1.0/self.most_recent_scale)
  File "mylib/torchenv/lib/python3.6/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__
    *args)
RuntimeError: expected noop flag to be on the same device as tensors

Here’s my code to do the forward/backward step:

for images, targets in progress_bar(self.train_loader, parent=mb):
    targets = {k: v.to(self.device) for k, v in targets.items()}

    with torch.cuda.amp.autocast():
        loss_dict = model(images, targets)

    # Get loss values from dict
    loss = loss_dict["loss"]

    # Scales the loss, and calls backward()
    # to create scaled gradients
    scaler.scale(loss).backward()
    # Unscales gradients and calls
    # or skips optimizer.step()
    scaler.step(optimizer)
    # Updates the scale for next iteration
    optimizer.zero_grad()
    scaler.update()

and scaler is just scaler = torch.cuda.amp.GradScaler()

Mixing native amp (via torch.cuda.amp) and apex/amp might yield these issues and we recommend to stick to the native implementation. multi_tensor_apply is already implemented in upstream PyTorch and more optimizers are being worked on to enable it.