torch.optim.SGD results in NaN

hiepnguyen034 · May 8, 2019, 2:08am

I followed this tutorial and tried to modify it a little bit to see if I understand things correctly. However, when I tried to use torch.opim.SGD

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
device = torch.device("cuda:0")

x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
w1 = torch.nn.Parameter(torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True))
w2 = torch.nn.Parameter(torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True))
lr = 1e-6
optimizer=torch.optim.SGD([w1,w2],lr=lr)
for t in range(500):
    layer_1 = x.matmul(w1)
    layer_1 = F.relu(layer_1)
    y_pred = layer_1.matmul(w2)
    loss = (y_pred - y).pow(2).sum()
    print(t,loss.item())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

My loss blows up to Inf at the third iteration and to nan afterwards, which is completely different compared to updating it manually. The code for updating it manually is below (also in the tutorial link).

x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)


w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    loss.backward()

    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        w1.grad.zero_()
        w2.grad.zero_()

I wonder what is wrong with my modified version. When I replaced SGD with Adam, the results came out pretty nice (decreasing after each iteration, no Inf or nan).