Poor results with pytorch compared to NN written from scratch

Hi,

I have an NN model which is written from stratch.This model has 2 neurons input layer, 1 hidden layer with 5 neurons and 2 output neurons. In hidden layer there is tanh activation. Cross entropy loss is being used. Weight 2 regularization aplied with lambda 0.01. Also weights are initialized with Xavier initialization. Learning rate is: 0.001. After 1000 epoch, this model has gives a Loss value of 0.093, and accuracy of 0.970 with below dataset.

x,y = sklearn.datasets.make_moons(n_samples=200,noise=0.05)

I decided to build the same model using Pytorch. However, I got very poor results in Pytorch. It seemed to be quiet strange.

Can you please check my Pytorch code and comment on why did I got poor results?

input_neurons = 2
hidden_neurons = 5
output_neurons = 2
learning_rate = 0.001
lambda_reg = 0.01

x,y = sklearn.datasets.make_moons(n_samples=200,noise=0.05)

x = torch.FloatTensor(x)
y = torch.LongTensor(y)
class FeedForward(torch.nn.Module):
def init(self, input_neurons, hidden_neurons, output_neurons):
super(FeedForward,self).init()
self.hidden = nn.Linear(input_neurons, hidden_neurons)
self.out = nn.Linear(hidden_neurons,output_neurons)
def forward(self, x):
x = self.hidden(x)
x = F.tanh(x)
x = self.out(x)
return x

def init_weights(m):
if type(m) == nn.Linear:
torch.nn.init.xavier_uniform(m.weight)
#m.bias.data.fill_(0.01)

network = FeedForward(input_neurons = input_neurons, hidden_neurons = hidden_neurons, output_neurons = output_neurons)
network.apply(init_weights)

optimizer = torch.optim.SGD(network.parameters(), lr = learning_rate,weight_decay=lambda_reg)
loss_function = torch.nn.CrossEntropyLoss()

for epoch in range(1000):
out = network(x)
loss = loss_function(out, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

if epoch % 50 == 0:
    max_value, prediction = torch.max(out, 1)
    predicted_y = prediction.data.numpy()
    target_y = y.data.numpy()
    accuracy = (predicted_y == target_y).sum() / target_y.size
    print('Accuracy = {:.2f} Loss = {:.2f}'.format(accuracy,loss))

Hi Ömer!

You don’t show your scikit-learn code nor your pytorch results, so its
hard to know the the difference might lie.

However, I don’t see anything wrong with your code (other than it being
garbled by the forum because you didn’t enclose it in a triple-backtick
code block).

You code seems to run fine for me, and gives good results if I increase
the learning rate, get rid of weight-decay (l2 regularization), and add
“momentum” to the SGD optimizer.

Note, I am using pytorch version 0.3.0, so I had to wrap the Tensors
in autograd.Variables.

Here is your code, as slightly modified by me:

import torch
import torch.nn as nn
import torch.nn.functional as F
torch.__version__

import numpy as np
import sklearn.datasets

torch.manual_seed (2020)
np.random.seed (2020)

x,y = sklearn.datasets.make_moons(n_samples=200,noise=0.05)

x = torch.autograd.Variable (torch.FloatTensor(x))   # using pytorch version 0.3.0
y = torch.autograd.Variable (torch.LongTensor(y))    # using pytorch version 0.3.0

input_neurons = 2
hidden_neurons = 5
output_neurons = 2
# learning_rate = 0.001
learning_rate = 0.1   # increase learning rate
# lambda_reg = 0.01
lambda_reg = 0.00     # turn off weight decay

momentum = 0.9        # add momentum to SGD

class FeedForward(torch.nn.Module):
    def __init__(self, input_neurons, hidden_neurons, output_neurons):
        super(FeedForward,self).__init__()
        self.hidden = nn.Linear(input_neurons, hidden_neurons)
        self.out = nn.Linear(hidden_neurons,output_neurons)
    def forward(self, x):
        x = self.hidden(x)
        x = F.tanh(x)
        x = self.out(x)
        return x

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform(m.weight)
        # m.bias.data.fill_(0.01)

network = FeedForward(input_neurons = input_neurons, hidden_neurons = hidden_neurons, output_neurons = output_neurons)
network.apply(init_weights)

# optimizer = torch.optim.SGD(network.parameters(), lr = learning_rate,weight_decay=lambda_reg)
optimizer = torch.optim.SGD(network.parameters(), lr = learning_rate, weight_decay = lambda_reg, momentum = momentum)
loss_function = torch.nn.CrossEntropyLoss()

for epoch in range(1000):
    out = network(x)
    loss = loss_function(out, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch % 50 == 0:
        max_value, prediction = torch.max(out, 1)
        predicted_y = prediction.data.numpy()
        target_y = y.data.numpy()
        accuracy = (predicted_y == target_y).sum() / target_y.size
        print('Accuracy = {:.3f} Loss = {:.6f}'.format(accuracy,loss.data[0]))

Here is the output, showing rapid convergence to 100% accuracy:

Accuracy = 0.500 Loss = 0.761427
Accuracy = 0.885 Loss = 0.255683
Accuracy = 0.885 Loss = 0.241214
Accuracy = 0.905 Loss = 0.218220
Accuracy = 0.955 Loss = 0.119875
Accuracy = 0.990 Loss = 0.048679
Accuracy = 1.000 Loss = 0.027161
Accuracy = 1.000 Loss = 0.018272
Accuracy = 1.000 Loss = 0.013567
Accuracy = 1.000 Loss = 0.010696
Accuracy = 1.000 Loss = 0.008776
Accuracy = 1.000 Loss = 0.007410
Accuracy = 1.000 Loss = 0.006392
Accuracy = 1.000 Loss = 0.005607
Accuracy = 1.000 Loss = 0.004983
Accuracy = 1.000 Loss = 0.004478
Accuracy = 1.000 Loss = 0.004060
Accuracy = 1.000 Loss = 0.003710
Accuracy = 1.000 Loss = 0.003412
Accuracy = 1.000 Loss = 0.003156

Good luck.

K. Frank

1 Like

Thank you Frank! Appreciated…

After setting weight decay parameter to 0, altering learning rate to 0.1 and adding momentum, I got similar results as yours which is pretty good.

The thing I still did not understand is that in pure python implementation, I used a learning rate of 0.001 and lambda value of 0.01 for L2 weight normalization. And no momentum applied. How come pure python implementation resulted in a much better performance than my first Pytorch model? I really don’t know.

By the way below is the pure python implementation code:

import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets

np.random.seed(3)
x,y = sklearn.datasets.make_moons(n_samples=200,noise=0.05)

input_neurons = 2
hidden_neurons = 5
output_neurons = 2
samples = x.shape[0]
learning_rate = 0.001
lambda_reg = 0.01

Define Initial Weights

def init_network(input_dim, hidden_dim, output_dim):
model = {}
# Xavier Initialization
W1 = np.random.randn(input_dim, hidden_dim) / np.sqrt(input_dim)
b1 = np.zeros((1, hidden_dim))
W2 = np.random.randn(hidden_dim, output_dim) / np.sqrt(hidden_dim)
b2 = np.zeros((1, output_dim))
model[‘W1’] = W1
model[‘b1’] = b1
model[‘W2’] = W2
model[‘b2’] = b2
return model

def retreive(model_dict):
W1 = model_dict[‘W1’]
b1 = model_dict[‘b1’]
W2 = model_dict[‘W2’]
b2 = model_dict[‘b2’]
return W1, b1, W2, b2

def forward(x, model_dict):
W1, b1, W2, b2 = retreive(model_dict)
z1 = x.dot(W1) + b1
a1 = np.tanh(z1)
z2 = a1.dot(W2) + b2
exp_scores = np.exp(z2)
softmax = exp_scores / np.sum(exp_scores, axis = 1, keepdims = True)
return z1, a1, softmax

def loss(softmax, y, model_dict):
W1, b1, W2, b2 = retreive(model_dict)
m = np.zeros(200)
for i,correct_index in enumerate(y):
predicted = softmax[i][correct_index]
m[i] = predicted
log_prob = -np.log(m)
loss = np.sum(log_prob)
reg_loss = lambda_reg / 2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
loss+= reg_loss
return float(loss / y.shape[0])

def backpropagation(x, y, model_dict, epochs):
for i in range(epochs):
W1, b1, W2, b2 = retreive(model_dict)
z1, a1, probs = forward(x, model_dict) # a1: (200,3), probs: (200,2)
delta3 = np.copy(probs)
delta3[range(x.shape[0]), y] -= 1 # (200,2)
dW2 = (a1.T).dot(delta3) # (3,2)
db2 = np.sum(delta3, axis=0, keepdims=True) # (1,2)
delta2 = delta3.dot(W2.T) * (1 - np.power(np.tanh(z1), 2))
dW1 = np.dot(x.T, delta2)
db1 = np.sum(delta2, axis=0)
# Add regularization terms
dW2 += lambda_reg * np.sum(W2)
dW1 += lambda_reg * np.sum(W1)
# Update Weights: W = W + (-lrgradient) = W - lrgradient
W1 += -learning_rate * dW1
b1 += -learning_rate * db1
W2 += -learning_rate * dW2
b2 += -learning_rate * db2
# Update the model dictionary
model_dict = {‘W1’: W1, ‘b1’: b1, ‘W2’: W2, ‘b2’: b2}
# Print the loss every 50 epochs
if i % 50 == 0:
prediction = np.argmax(probs, 1)
accuracy = (prediction == y).sum() / y.size
print(“Loss at epoch {} is: {:.3f}, accuracy is {:.3f}”.format(i, loss(probs, y, model_dict),accuracy))

return model_dict

def predict(model_dict, x):
W1, b1, W2, b2 = retreive(model_dict)
z1 = x.dot(W1) + b1
a1 = np.tanh(z1)
z2 = a1.dot(W2) + b2
exp_scores = np.exp(z2)
softmax = exp_scores / np.sum(exp_scores, axis = 1, keepdims = True) # (200,2)
return np.argmax(softmax, axis = 1) # (200,)

model_dict = init_network(input_dim = input_neurons , hidden_dim = hidden_neurons, output_dim = output_neurons)
model = backpropagation(x, y, model_dict, 1000)

Hi Ömer!

I don’t know either.

But somewhere your pytorch code and numpy code are doing
something different.

I would suggest looking at a single forward and backward pass,
and see where things first differ. Make sure you feed the same
sklearn.datasets into both your pytorch and numpy networks.
Make sure both networks are initialized with the same weights
and biases.

Run a batch of inputs through both networks. Do you get the same
predictions? If not, compare intermediate results until you localize
the difference.

If the predictions are the same, do you get the same loss value?

Run your backward pass. Do you get the same gradients? If not
look at intermediate results, etc.

There is no reason you shouldn’t be able to get the same results
with pytorch and numpy (up to reasonable round-off error, which
isn’t your issue here).

(As an aside, I don’t see your numpy code backpropagating through
the softmax calculation in your model’s forward() function.)

Good luck.

K. Frank