Time of DataLoader with O0 and O1 is quite different,

    # in the train process
    end = time.time()
    for i, data_in in enumerate(train_loader):
        # measure data loading time
        if data_in is None:
            print("empty batch")
        input, target = data_in
        data_time.update(time.time() - end)
        input_image = input.cuda()
        target = target.cuda()

I try the train a reset50 use my own dataset with apex. The time of O0 and O1 is almost the same. It seems that O1 always apend a lot more time than O0 when coming up with big pictures in dataloader, that will offsets the speed up taken by the fp16 calculation. my data process include random resize, flip, jitter and normalize.

Could you time your data loading as done in the ImageNet example?

If your bottleneck is the data loading, you won’t get any significant speedup using AMP.