# Can't use BCELoss correctly

Hi,

I am doing an implementation the exercises for Machine Learning Lecture from Coursera by Andrew Ng.
In ex02, my mission is predicting the result of someone’s admission to a school.
As a dataset, there given 2 exam score as x1 and x2, and 1 or 0 as a result of admission.

I correctly implemented Logistic Regression for this problem in Octave with same dataset.
But It doesn’t work at all in Pytorch. Doing this simple work whole day, I ask your help.

As simple code, I am uploading my code in Pytorch.
The problems are below.

1. The loss goes down somewhat, but it is always above 0.6 but It was down as 0.2 when I did in Octave.
2. As a result, it doesn’t predict well… actually never.
``````import pandas as pd
import numpy as np
import torch
import matplotlib.pyplot as plt
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F

class LogisticRegression(nn.Module):
def __init__(self, input_size, num_class):
super(LogisticRegression, self).__init__()
self.linear = nn.Linear(input_size, num_class)

def forward(self, x):
out = self.linear(x)
return F.sigmoid(out)

data = pd.read_csv('data/ex2data1.txt', header=None)

X = Variable(torch.from_numpy(data.iloc[:, :2].as_matrix()).float())
Y = Variable(np.reshape(torch.from_numpy(data.iloc[:, 2].as_matrix()).float(),
(-1, 1)))
input_size = 2
num_class = 1
total_epoch = 1000
learning_rate = 0.0001

model = LogisticRegression(input_size, num_class)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

loss_history = []
for epoch in range(total_epoch):
output = model(X)

optimizer.zero_grad()
loss = criterion(output, Y)
loss.backward()
optimizer.step()

if (epoch + 1) % 5 == 0:
print('Epoch: [%d/%d], Loss: %.4f' %(epoch+1, total_epoch, loss.data[0]))
loss_history.append(loss.data[0])

input = Variable(torch.Tensor([20, 30]))
print("predict", (20, 30), model(input).data)

input = Variable(torch.Tensor([80, 30]))
print("predict", (80, 30), model(input).data)

input = Variable(torch.Tensor([10, 10]))
print("predict", (10, 10), model(input).data)

input = Variable(torch.Tensor([45, 85]))
print("predict", (45, 85), model(input).data)

def plotLossHistory(loss_history):
plt.figure()
plt.plot(loss_history)
plt.show()

plotLossHistory(loss_history)

``````

I tried your code and it seems to work, but is unfortunately painfully slow.
I changed the optimizer to `Adam` and initialized the weights with `xavier_uniform`.
This code should give you a loss of approx. `0.204`.

``````torch.manual_seed(2809)

class LogisticRegression(nn.Module):
def __init__(self, input_size, num_class):
super(LogisticRegression, self).__init__()
self.linear = nn.Linear(input_size, num_class)
self.initialize()

def initialize(self):
nn.init.xavier_uniform(self.linear.weight.data)
self.linear.bias.data.zero_()

def forward(self, x):
out = self.linear(x)
return F.sigmoid(out)

data = pd.read_csv('~/Downloads/ex2data1.txt', header=None)

X = Variable(torch.from_numpy(data.iloc[:, :2].as_matrix()).float())
Y = Variable(torch.from_numpy(data.iloc[:, 2].as_matrix()).float().view(-1, 1))

input_size = 2
num_class = 1
total_epoch = 250000
learning_rate = 1e-4

model = LogisticRegression(input_size, num_class)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

loss_history = []

for epoch in range(total_epoch):
output = model(X)
loss = criterion(output, Y)
optimizer.zero_grad()
loss.backward()
optimizer.step()

if (epoch + 1) % 10000 == 0:
print('Epoch: [%d/%d], Loss: %.4f' %(epoch+1, total_epoch, loss.data[0]))
loss_history.append(loss.data[0])
``````
1 Like

Thank you! You are great.
Your code works well.

I got two question after this.

1. When I use SGD the loss goes down to 0.5 but it’s down to 0.2 if I use ADAM as you did.
I learned that the Binary Cross Entropy cost function is convex.
So It should goes down to global minimum.
But It stuck at some point(0.5) do you have any idea?

2. When I implemented as the lecture in Corsera machine learning ex02, I got loss 0.2 very quickly saying 400 epoch. But I saw even when I use Adam optimizer, It takes over 10000 epoch to 0.2 loss. The two code - in pytorch, in octave - looks having same logic tho. I’m confused

Thank you again

1 Like

Well, I could push it down using `SGD` to `~0.2` as well, it just took an incredible amount of epochs.
Could you post the Octave code please? I’m curious, what’s the reason for the speed difference.

1 Like

Oh yeah?
My loss history with SGD was like below.

Never went down after 0.5
My code in Octave has several files so I uploaded on the github.

ex2.m is main file and it proceeds the logic using ex2data1.txt.
In costFunction.m I implemented the BCE function.
And I used fminunc library function for training.

As you can see in ex2.m line 94, the number of iter is only 400 and it works well.

I checked your code and compared the gradients as you did here and here and they are the same up to the floating precision.
Unfortunately, I don’t know, what fminunc is doing exactly.
Are you sure it’s SGD? Maybe it calls some iterations internally on its own.

I replaced the SGD optimizer with LBFGS and it you’ll get a loss of `~0.2` in approx. 200 epochs.

``````model = LogisticRegression(input_size, num_class)
criterion = nn.BCELoss(reduce=True)
optimizer = torch.optim.LBFGS(model.parameters(), lr=1e-3)

loss_history = []

def closure():
optimizer.zero_grad()
output = model(X)
loss = criterion(output, Y)
loss.backward()
return loss

for epoch in range(400):
loss = optimizer.step(closure)

if (epoch + 1) % 10 == 0:
print('Epoch: [%d/%d], Loss: %.4f' %(epoch+1, total_epoch, loss))
loss_history.append(loss)
``````

EDIT:
Also I initialized the weights to zeros to compare both codes:

``````class LogisticRegression(nn.Module):
def __init__(self, input_size, num_class):
super(LogisticRegression, self).__init__()
self.linear = nn.Linear(input_size, num_class)
self.initialize()

def initialize(self):
self.linear.weight.data.zero_()
self.linear.bias.data.zero_()

def forward(self, x):
out = self.linear(x)
return F.sigmoid(out)
``````
1 Like

Oh applying LBFGS is interesting.
In fminunc3, i can set the max iteration.
So I set it 400.

Umm… Maybe there are some differences between stochastic gradient descent (SGD) algorithm and regular Gradient Descent. I couldn’t find regular Gradient Descent in Pytorch, that’s why I am using SGD even if I apply all batch at one potimization not using mini batch. And It went different.