FP16 on windows v0.4

thadar · April 29, 2018, 10:35am

Hey,

I’m experiencing an unusual bug on pytorch v0.4 for windows.
I’m training the same model using .cuda().half() and using .cuda() and i’m getting different results.
If i’m training with .cuda().half() the loss is nan after the second step, but if i’m training with just .cuda() the loss is decreasing but i can’t increase my batch size.

i’m using the following versions:

__Python VERSION: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
__pyTorch VERSION: 0.4.0
__CUDA VERSION
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:32_Central_Daylight_Time_2017
Cuda compilation tools, release 9.0, V9.0.176
__CUDNN VERSION: 7005

Anyone else experiencing this on the windows version?

Tal

tom · April 29, 2018, 1:58pm

Half is touchy w.r.t. numerical stability. Without further analysis of your network, it would seem much more likely that you are running into this inherent instability rather than an actual PyTorch bug. That said, if you reduce it to “this op with these inputs is NaN when it shouldn’t”, it could be a bug.

Best regards

Thomas