Fp16 on pytorch 0.4

tbenst · July 10, 2018, 9:12pm

The examples at https://github.com/csarofeen/examples/tree/dist_fp16/imagenet are for Pytorch 0.3 it appears. In particular, when I tried to update set_grad in fp16utils by removing .data, I get the following error. Any tips? Thank you!

RuntimeError                              Traceback (most recent call last)
<ipython-input-18-d670cc97fa5f> in <module>()
    174 print("total num params:", np.sum([np.prod(x.shape) for x in conv_model.parameters()]))
    175 # conv_model(data[0][0][None,:,None].cuda()).shape
--> 176 train(conv_model,data,100,lr=1e-3, half=half, cuda=cuda)
    177 # 1.91E+02

<ipython-input-18-d670cc97fa5f> in train(model, data, nepochs, lr, half, cuda)
    142                 model.zero_grad()
    143                 loss.backward()
--> 144                 set_grad(param_copy, list(model.parameters()))
    145             else:
    146                 optimizer.zero_grad()

<ipython-input-15-8c52d7e662cf> in set_grad(params, params_with_grad)
      4         if param.grad is None:
      5             param.grad = torch.nn.Parameter(param.data.new().resize_(*param.data.size()))
----> 6         param.grad.copy_(param_w_grad.grad.data)

RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.

smth · July 10, 2018, 11:51pm

you can look here for 0.4 based fp16 imagenet: https://github.com/NVIDIA/apex/tree/master/examples/imagenet