How to replace apex.amp by pytorch amp?

pvtien96 · June 14, 2023, 1:24pm

Hi, I’m training a resnet model on a machine of 4xA40. The code is from this repository.

When I run the program, it logs:

/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/apex/__init__.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use [PyTorch AMP](https://pytorch.org/docs/stable/amp.html)
  warnings.warn(msg, DeprecatedFeatureWarning)
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : 128.0
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
/home/van-tien.pham/anaconda3/lib/python3.9/site-packages/apex/__init__.py:68: DeprecatedFeatureWarning: apex.parallel.DistributedDataParallel is deprecated and will be removed by the end of February 2023.
  warnings.warn(msg, DeprecatedFeatureWarning)
Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
RUNNING EPOCHS FROM 0 TO 250
DLL 2023-06-14 15:12:48.267767 - Epoch: 0 Iteration: 100  train.loss : 6.95447  train.total_ips : 1020.75 img/s
DLL 2023-06-14 15:14:16.889414 - Epoch: 0 Iteration: 200  train.loss : 6.85262  train.total_ips : 1042.96 img/s

My understanding is that things still work. But I wonder how can I replace the nvidia apex by pytorch apex as recommended. In the main program, the current code is:

try:
    from apex.parallel import DistributedDataParallel as DDP
    from apex.fp16_utils import *
    from apex import amp
except ImportError:
    raise ImportError(
        "Please install apex from https://www.github.com/nvidia/apex to run this example."
    )
# other codes
#############
if args.amp:
        model_and_loss, optimizer = amp.initialize(
            model_and_loss,
            optimizer,
            opt_level="O1",
            loss_scale="dynamic" if args.dynamic_loss_scale else args.static_loss_scale,
        )

    if args.distributed:
        model_and_loss.distributed()

    model_and_loss.load_model_state(model_state)

    train_loop(
        model_and_loss,
        optimizer,
        lr_policy,
        train_loader,
        val_loader,
        args.fp16,
        logger,
        should_backup_checkpoint(args),
        use_amp=args.amp,
        batch_size_multiplier=batch_size_multiplier,
        start_epoch=start_epoch,
        end_epoch=(start_epoch + args.run_epochs)
        if args.run_epochs != -1
        else args.epochs,
        best_prec1=best_prec1,
        prof=args.prof,
        skip_training=args.evaluate,
        skip_validation=args.training_only,
        save_checkpoints=args.save_checkpoints and not args.evaluate,
        checkpoint_dir=args.workspace,
        checkpoint_filename=args.checkpoint_filename,
        args=args,
    )

I guess that this line needs to be modified:
model_and_loss, optimizer = amp.initialize( model_and_loss, optimizer, opt_level="O1", loss_scale="dynamic" if args.dynamic_loss_scale else args.static_loss_scale, )

I read some threads about pytorch amp (Torch distributed data-parallel vs Apex distributed data-parallel - #5 by c_cj, ) and another repository that uses native amp of torch as follows:

if config.AMP_OPT_LEVEL != "O0":
        if use_amp == 'apex':
            model, optimizer = amp.initialize(model,
                                              optimizer,
                                              opt_level=config.AMP_OPT_LEVEL)
            loss_scaler = ApexScaler()
            if config.LOCAL_RANK == 0:
                logger.info(
                    'Using NVIDIA APEX AMP. Training in mixed precision.')
        if use_amp == 'native':
            amp_autocast = torch.cuda.amp.autocast
            loss_scaler = NativeScaler()
            if config.LOCAL_RANK == 0:
                logger.info(
                    'Using native Torch AMP. Training in mixed precision.')
        else:
            if config.LOCAL_RANK == 0:
                logger.info('AMP not enabled. Training in float32.')

But I’m unable to figure out how to use torch amp to replace the aforementioned line model_and_loss, optimizer = amp.initialize(...)

Another question is that does training with nvidia amp vs torch amp yield different accuracy or this stuff just relates to training speed?

Recommendations are appreciated! Thanks in advance!

ptrblck · June 14, 2023, 2:25pm

Check these examples to see how amp is applied using the native utils.