is it right that optimizer.step() only does: p_{n+1} = p_n - alpha*gradient(p_n)? It only performs one step?
If it does so, would’nt it better to perform more steps during one training?
For Example if I had a classifier wouldnt it make sense to do the following:
def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
for batch, (X, y) in enumerate(dataloader):
# Compute prediction and loss
X=X.to(device=0)
y=y.cuda()
for i in range(1000):
pred = model(X.to(device=0))
loss = loss_fn(pred, y)
# Backpropagation
for i in range(1000):
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch % 100 == 0:
loss, current = loss.item(), (batch + 1) * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
Instead of only doing:
def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
for batch, (X, y) in enumerate(dataloader):
# Compute prediction and loss
X=X.to(device=0)
y=y.cuda()
pred = model(X.to(device=0))
loss = loss_fn(pred, y)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch % 100 == 0:
loss, current = loss.item(), (batch + 1) * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
A very simple SGD optimizer would use this update rule, yes.
No, I don’t think so, since the gradients were computed from the initial parameter set (let’s call it p0) while you would apply them to the already updated parameters p1, p2, p3, etc.
The second approach will create new outputs using the updated parameter set, would then calculate the new loss, and thus the new and corresponding gradients.
In a very simple example you could check this illustration from Wikipedia:
In your first example you would apply the first “arrow” multiple times and might thus not converge to the local minimum.
pred = model(X)
loss _= loss_fn(pred, y)
for i in range(1000):
optimizer.zero_grad()
loss.backward()
optimizer.step()
should throw an error, because step modifies the leaf nodes (your model parameters) required for backward in-place.
Doing something like this (when using plain SGD)
pred = model(X)
loss _= loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
for i in range(1000):
optimizer.step()
is just a very inefficient way of multiplying your learning rate by a factor of 1000
As a side note: Increasing the step size / learning rate is not always a bad idea. It can also increase training speed and even improve final model performance, but initializing your optimizer with a higher learning rate is the preferred way to do it. What exact learning rate / step size you should go for depends a lot on your problem at hand and finding a good value usually requires some degree of trial-and-error.
Let me get that clear. So you say, that in my first approch, I would also use the Vector from x0 to x1 all the time and miss the local Minimum, is that right?
Yes, if you call optimizer.step() inside the loop and move the backward() out as this will already raise a runtime error due to an invalid gradient calculation as pointed out by @Joschka.