# in the train process
end = time.time()
for i, data_in in enumerate(train_loader):
# measure data loading time
if data_in is None:
print("empty batch")
continue
input, target = data_in
torch.cuda.synchronize()
data_time.update(time.time() - end)
input_image = input.cuda()
target = target.cuda()
I try the train a reset50 use my own dataset with apex. The time of O0 and O1 is almost the same. It seems that O1 always apend a lot more time than O0 when coming up with big pictures in dataloader, that will offsets the speed up taken by the fp16 calculation. my data process include random resize, flip, jitter and normalize.