Hello, I have created a data-loader object, I set the parameter batch size equal to five and I run the following code. I would like some clarification, is the following code performing mini-batch gradient descent or stochastic gradient descent on a mini-batch.
from torch import nn
import torch
import numpy as np
import matplotlib.pyplot as plt
from torch import nn,optim
from torch.utils.data import Dataset, DataLoader
class Data(Dataset):
def __init__(self):
self.x=torch.arange(-3,3,0.1).view(-1, 1)
self.y=-3*self.x+1+0.1*torch.randn(self.x.size())
self.len=self.x.shape[0]
def __getitem__(self,index):
return self.x[index],self.y[index]
def __len__(self):
return self.len
class linear_regression(nn.Module):
def __init__(self,input_size,output_size):
super(linear_regression,self).__init__()
self.linear=nn.Linear(input_size,output_size)
def forward(self,x):
yhat=self.linear(x)
return yhat
class linear_regression(nn.Module):
def __init__(self,input_size,output_size):
super(linear_regression,self).__init__()
self.linear=nn.Linear(input_size,output_size)
def forward(self,x):
yhat=self.linear(x)
return yhat
model=linear_regression(1,1)
optimizer = optim.SGD(model.parameters(), lr = 0.01)
criterion = nn.MSELoss()
dataset=Data()
trainloader=DataLoader(dataset=dataset,batch_size=5)
LOSS=[]
n=1;
for epoch in range(5):
for x,y in trainloader:
yhat=model(x)
loss=criterion(yhat,y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
LOSS.append(loss)
I’m not sure what stochastic gradient descent on a mini-batch is, since as far as my understanding is, stochastic gradient descent uses only one sample by definition.
Because you use a batch size of 5, your code applies mini-batch gradient descent.
Thanks for the response, my confusion comes from the fact that the code included below calculates SGD by taking a tensor that is the size of my training set and performing at update one sample at the time. i.e for each iteration of the loop an update of SGD is performed for every sample in the tensor
In the initial code in the second nested loop, the data loader provides a tensor the size of my mini-batch, then how come it does not follow the same procedure as above, i.e for each iteration performing an update on each sample in the tensor.
from torch import nn
import torch
import numpy as np
import matplotlib.pyplot as plt
from torch import nn,optim
from torch.utils.data import Dataset, DataLoader
class Data(Dataset):
def __init__(self):
self.x=torch.arange(-3,3,0.1).view(-1, 1)
self.y=-3*self.x+1+0.1*torch.randn(self.x.size())
self.len=self.x.shape[0]
def __getitem__(self,index):
return self.x[index],self.y[index]
def __len__(self):
return self.len
class linear_regression(nn.Module):
def __init__(self,input_size,output_size):
super(linear_regression,self).__init__()
self.linear=nn.Linear(input_size,output_size)
def forward(self,x):
yhat=self.linear(x)
return yhat
class linear_regression(nn.Module):
def __init__(self,input_size,output_size):
super(linear_regression,self).__init__()
self.linear=nn.Linear(input_size,output_size)
def forward(self,x):
yhat=self.linear(x)
return yhat
model=linear_regression(1,1)
optimizer = optim.SGD(model.parameters(), lr = 0.01)
criterion = nn.MSELoss()
dataset=Data()
x,y=dataset[:]
LOSS=[]
n=1;
for epoch in range(5):
yhat=model(x)
loss=criterion(yhat,y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
LOSS.append(loss)
In your current code snippet you are assigning x to your complete dataset, i.e. you are performing batch gradient descent.
In the former code your DataLoader provided batches of size 5, so you used mini-batch gradient descent.
If you use a dataloader with batch_size=1 or slice each sample one by one, you would be applying stochastic gradient descent.
The averaged or summed loss will be computed based on your batch size. E.g. if your batch size is 5, and you are using your criterion with its default setting size_average=True, the average or the losses for each sample in the batch will be calculated and used to compute the gradients.
Sorry to necro this post, but it bothers me: why is it named a stochastic gradient descent optimizer even if we are doing full batch gradient descent? Is there any sort of difference between what vanilla gradient descent would do vs. the sgd optimizer would do when we run it with the full batch?
Ahh, actually sorry, it’s just a mismatch in terminology. The SGD optimizer is vanilla gradient descent (i.e. literally all it does is subtract the gradient * the learning rate from the weight, as expected). See here: How SGD works in pytorch
If you want to store the whole computational graph then its okay to use LOSS.append(loss); but if you are just looking to store the loss value then use LOSS.append(loss.item()).
Also, shouldn’t you be accumulating the loss (for each batch) within each epoch and then appending it to loss array like this (below):
LOSS=[]
...
for epoch in range(5):
running_loss = 0 #accumulates loss of each batch
for x,y in trainloader:
# do something......
loss.backward()
optimizer.step()
running_loss += loss.item() # accumulating loss
LOSS.append(running_loss) # saving final loss for each epoch
just a question, if we write running_loss += loss.item() after an update of the parameters, I think like that we calculate these:
for epoch in range(5):
running_loss = 0 #accumulates loss of each batch
for x,y in trainloader:
running_loss += loss.item() <=> F* = F(x,y, w_{0})+ F(x,y, w_{1})+F(x,y, w_{2})+…
in an epoch with F isobjective function.
however our objective function defined by :
F(data,w) = F(x,y, w)+ F(x,y,w)+F(x,y,w)+… for (x,y) in data