Hi,

I tried to compare the optimization of Neural network modeling using

- Keras
- Pytorch (using weight_decay argument in torch.optim to add regularizer)
- Pytorch (creating function to calculate L2 norm of weights along with coefficient to add regularizer)

I tried training the network and see the result after 1 step of Stochastic Gradient Descent to see the comparison across the models.

By setting the regularization coefficient to zero, all the above 3 methods should yield the same result (weights update, prediction value) given the same training data and initial weights and this is the case for most of the time.

However, sometimes the weights update is different along with some strangness in the program.

I will first explain the strangeness I found. I write out the code to print out some thing when there is a huge difference in prediction between method 1.vs 2. and 1. vs 3. It sometimes reported this

```
Gap of output with builtin pytorch 18.459202
Gap of output with manual pytorch 18.459202
```

However when I checked the quantity myself, the gap disappeared as follows

```
In[410]: np.average(np.abs(keras_output-builtin_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor)).data.numpy()))
Out[410]: 4.091505e-06
```

It seems like something has been updated simultaneously (like updating the gradient and projection) after the first printing

I later on tried to see what is going on by looking at the gradient calculated on one of its weight of both method 2 and 3 which they contain the same gradient

```
In [412]: manual_model.hidden[1].weight.grad
Out[412]: tensor([[-0.5983]])
In [413]: builtin_model.hidden[1].weight.grad
Out[413]: tensor([[-0.5983]])
```

I can manually calculate the updated weight to see which one differs from my expectation

At command [414], I calculated the updated weight = old weight + learning rate x gradient and [415] is shown that this is the same value as method 3

```
In [414]: temp_model.hidden[1].weight.data-learn_rate*manual_model.hidden[1].weight.grad
Out[414]: tensor([[1.1240]])
In [415]: manual_model.hidden[1].weight.data
Out[415]: tensor([[1.1240]])
```

The updated weight of method 2 is as follows

```
In [417]: builtin_model.hidden[1].weight.data
Out[417]: tensor([[1.1257]])
```

I included the major component of the code for reference

```
#Construct Model
temp_n_hidden = copy.deepcopy(n_hidden)
manual_model=MultilayerPerceptron(input_dim, output_dim, temp_n_hidden, dropout)
temp_n_hidden = copy.deepcopy(n_hidden)
builtin_model=MultilayerPerceptron(input_dim, output_dim, temp_n_hidden, dropout)
temp_n_hidden = copy.deepcopy(n_hidden)
temp_model=MultilayerPerceptron(input_dim, output_dim, temp_n_hidden, dropout)
#Construct Keras model
temp_n_hidden = copy.deepcopy(n_hidden)
keras_model=MultilayerPerceptron_Keras(input_dim, output_dim, temp_n_hidden, dropout)
#Copy Weight from Keras to Pytorch
keras_to_pyt(keras_model,builtin_model)
#Copy Weight from Pytorch to Pytorch
manual_model.load_state_dict(builtin_model.state_dict())
temp_model.load_state_dict(builtin_model.state_dict())
#Set up data for training
dataset=Data_prep(X_train, y_train_normalized)
train_loader=DataLoader(dataset=dataset,
batch_size=batch_size,
shuffle=True,
num_workers=0)
#Training Process
builtin_optimizer = torch.optim.SGD(list(builtin_model.parameters()), lr=learn_rate)
manual_optimizer = torch.optim.SGD(list(manual_model.parameters()), lr=learn_rate)
count=0
for epoch in range(n_epochs):
for i, data in enumerate(train_loader,0):
#data for pytorch
inputs, y=data
inputs, y=to_variable(var=(inputs, y))
#Manual Training
manual_outputs = manual_model.forward(inputs)
MSE=loss_func(manual_outputs, y)
Weight_Regularization=l2_regularization(reg/2.0,manual_model)
manual_loss = MSE+Weight_Regularization
manual_optimizer.zero_grad()
manual_loss.backward()
manual_optimizer.step()
#builtin opt Training
builtin_outputs = builtin_model.forward(inputs)
MSE=loss_func(builtin_outputs, y)
builtin_loss = MSE
builtin_optimizer.zero_grad()
builtin_loss.backward()
builtin_optimizer.step()
temp= (manual_model.hidden[1].weight.data-builtin_model.hidden[1].weight.data).abs().sum().data.numpy().item()
if count % 100 == 0 :
print("Weight Regularizaation ",Weight_Regularization.data)
print("Weight difference %.8f" % temp)
print("Built-in MSE ", builtin_loss.data,"\n")
print("Manual MSE ", loss_func(manual_outputs.data, y),"\n")
count+=1
#Keras training
#data for keras
y_train_normalized= np.array(y.numpy(), ndmin = 2).T
adam = keras.optimizers.Adam(lr=learn_rate, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0, amsgrad=False)
sgd = keras.optimizers.SGD(lr=learn_rate)
keras_model.compile(loss='mean_squared_error', optimizer=sgd)
keras_model.fit(X_train, y_train_normalized, batch_size=batch_size, nb_epoch=n_epochs, verbose=0)
#Comparison
keras_output=keras_model.predict(X_train, batch_size=500)
pytorch_builtin_output=builtin_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor))
pytorch_manual_output=manual_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor))
print("Gap of output with manual pytorch ",np.average(np.abs(keras_output-manual_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor)).data.numpy())))
print("Gap of output with builtin pytorch ",np.average(np.abs(keras_output-builtin_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor)).data.numpy())))
if np.average(np.abs(keras_output-builtin_model.forward(torch.from_numpy(X_train).type(torch.FloatTensor)).data.numpy()))>1:
print("Stop")
time.sleep(5.5)
```