When using:
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)
with torch.cuda.amp.autocast(enabled=use_amp):
...
how should use_amp be determined?
Should it explicitly reflect whether the GPU supports AMP, or is torch.cuda.is_available() sufficient?
Also, when using DDP, what is the recommended way to handle AMP in case different GPUs in the cluster have different AMP support?
Is heterogeneous AMP across ranks supported, or should AMP be enabled/disabled globally?