Strangeness in torch.optim

Hi,
I tried to compare the optimization of Neural network modeling using

  1. Keras
  2. Pytorch (using weight_decay argument in torch.optim to add regularizer)
  3. Pytorch (creating function to calculate L2 norm of weights along with coefficient to add regularizer)

I tried training the network and see the result after 1 step of Stochastic Gradient Descent to see the comparison across the models.

By setting the regularization coefficient to zero, all the above 3 methods should yield the same result (weights update, prediction value) given the same training data and initial weights and this is the case for most of the time.
However, sometimes the weights update is different along with some strangness in the program.

I will first explain the strangeness I found. I write out the code to print out some thing when there is a huge difference in prediction between method 1.vs 2. and 1. vs 3. It sometimes reported this

Gap of output with builtin pytorch  18.459202
Gap of output with manual pytorch  18.459202

However when I checked the quantity myself, the gap disappeared as follows

In[410]: np.average(np.abs(keras_output-builtin_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor)).data.numpy()))
Out[410]: 4.091505e-06

It seems like something has been updated simultaneously (like updating the gradient and projection) after the first printing

I later on tried to see what is going on by looking at the gradient calculated on one of its weight of both method 2 and 3 which they contain the same gradient

In [412]: manual_model.hidden[1].weight.grad

Out[412]: tensor([[-0.5983]])

In [413]: builtin_model.hidden[1].weight.grad

Out[413]: tensor([[-0.5983]])

I can manually calculate the updated weight to see which one differs from my expectation
At command [414], I calculated the updated weight = old weight + learning rate x gradient and [415] is shown that this is the same value as method 3

In [414]: temp_model.hidden[1].weight.data-learn_rate*manual_model.hidden[1].weight.grad
Out[414]: tensor([[1.1240]])

In [415]: manual_model.hidden[1].weight.data
Out[415]: tensor([[1.1240]])

The updated weight of method 2 is as follows

In [417]: builtin_model.hidden[1].weight.data
Out[417]: tensor([[1.1257]])

I included the major component of the code for reference

#Construct Model
temp_n_hidden = copy.deepcopy(n_hidden)
manual_model=MultilayerPerceptron(input_dim, output_dim, temp_n_hidden, dropout)
temp_n_hidden = copy.deepcopy(n_hidden)
builtin_model=MultilayerPerceptron(input_dim, output_dim, temp_n_hidden, dropout)
temp_n_hidden = copy.deepcopy(n_hidden)
temp_model=MultilayerPerceptron(input_dim, output_dim, temp_n_hidden, dropout)
#Construct Keras model
temp_n_hidden = copy.deepcopy(n_hidden)
keras_model=MultilayerPerceptron_Keras(input_dim, output_dim, temp_n_hidden, dropout)
#Copy Weight from Keras to Pytorch
keras_to_pyt(keras_model,builtin_model)
#Copy Weight from Pytorch to Pytorch
manual_model.load_state_dict(builtin_model.state_dict())
temp_model.load_state_dict(builtin_model.state_dict())
#Set up data for training
dataset=Data_prep(X_train, y_train_normalized) 
train_loader=DataLoader(dataset=dataset,
                        batch_size=batch_size,
                        shuffle=True,
                        num_workers=0)

#Training Process
builtin_optimizer = torch.optim.SGD(list(builtin_model.parameters()), lr=learn_rate)
manual_optimizer = torch.optim.SGD(list(manual_model.parameters()), lr=learn_rate)
count=0
for epoch in range(n_epochs):
    for i, data in enumerate(train_loader,0):
        #data for pytorch
        inputs, y=data
        inputs, y=to_variable(var=(inputs, y))
        
        #Manual Training
        manual_outputs = manual_model.forward(inputs)
        MSE=loss_func(manual_outputs, y)
        Weight_Regularization=l2_regularization(reg/2.0,manual_model)
        manual_loss = MSE+Weight_Regularization
        manual_optimizer.zero_grad()
        manual_loss.backward()
        manual_optimizer.step()
        #builtin opt Training
        builtin_outputs = builtin_model.forward(inputs)
        MSE=loss_func(builtin_outputs, y)
        builtin_loss = MSE
        builtin_optimizer.zero_grad()
        builtin_loss.backward()
        builtin_optimizer.step()   
        
        temp= (manual_model.hidden[1].weight.data-builtin_model.hidden[1].weight.data).abs().sum().data.numpy().item()
        
        if count % 100 == 0  :
            print("Weight Regularizaation ",Weight_Regularization.data)
           
            print("Weight difference %.8f" % temp)
            print("Built-in MSE ", builtin_loss.data,"\n")
            print("Manual MSE ", loss_func(manual_outputs.data, y),"\n")
        count+=1

#Keras training
#data for keras
y_train_normalized= np.array(y.numpy(), ndmin = 2).T
adam = keras.optimizers.Adam(lr=learn_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0, amsgrad=False)
sgd = keras.optimizers.SGD(lr=learn_rate)
keras_model.compile(loss='mean_squared_error', optimizer=sgd)
keras_model.fit(X_train, y_train_normalized, batch_size=batch_size, nb_epoch=n_epochs, verbose=0)

#Comparison
keras_output=keras_model.predict(X_train, batch_size=500)
pytorch_builtin_output=builtin_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor))
pytorch_manual_output=manual_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor))
print("Gap of output with manual pytorch ",np.average(np.abs(keras_output-manual_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor)).data.numpy())))
print("Gap of output with builtin pytorch ",np.average(np.abs(keras_output-builtin_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor)).data.numpy())))
if np.average(np.abs(keras_output-builtin_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor)).data.numpy()))>1:
    print("Stop")
    time.sleep(5.5)

Hi,

Could this be due to the fact that the weight_decay is computed during the gradient step and the .grad field is updated inplace during that? See here for sgd for example.

but the gradient of both method 2 and 3 is equal as expected (the weight_decay was set to 0), so I am not sure that cause the problem

Hi,

Sorry your post a bit hard to follow.
Is the problem: after n_epochs of training, pytorch and keras yeld models that do not predict the same result?

Sorry It’s a bit difficult to explain. Thanks for your answer.

There are 3 models

  1. Keras
  2. Pytorch (using weight_decay argument in torch.optim to add regularizer)
  3. Pytorch (creating function to calculate L2 norm of weights along with coefficient to add regularizer)

All 3 models showed the same result most of the time except when it printed

Gap of output with builtin pytorch  18.459202 (model  1 vs model 2)
Gap of output with manual pytorch  18.459202 (model  1 vs model 3)

which showed the gap between keras and pytorch.
However when the program ends, I still have those variables in the workspace (as I worked from IDE) so I printed out the value and the gap disappeared (exact same variable as above). It goes from 18.459202 to 4.091505e-06 as below.

In[410]: np.average(np.abs(keras_output-builtin_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor)).data.numpy()))
Out[410]: 4.091505e-06

There’s no code in between these steps though since the program will end after it print out a gap.

From your code it’s quite confusing as you don’t actually pass the weight_decay parameter to the optimizer in case 2. Is that just a typo? Or do you actually always have weight_decay to 0 and this has no link with weight decay?

Yeah after I found this strangeness, I drop the weight decay to zero in all of the models (since it might make it more difficult to find the causes).

Then I would say that you should rerun your experiment again in a clean environment. This is most likely something in between the end of your script and your new check that changed the value of some things.

Also for pytorch code:

  • Do not use the .forward() method on nn.Modules, simply use a call function: manual_model(inputs).
  • Do not use .data. If you want to modify a tensor without registering the change in the autograd, use with torch.no_grad(). If you want a new Tensor with the same values that does not share history use .detach().