Gradient decent steps in a batch?

sj27182 · June 9, 2021, 8:04pm

Hello,

I am quite new to pytorch, my background is more mathematical.
In the tutorials I’ve been following we use gradient decent as our optimization function. Recently we began using the DataLoader class, and from what I can tell, after taking one batch of observations, and differentiating the resulting weighted sum (inside the cost function), we call .step() on the optimizer, before immediately looping back around for a new batch of observations.

I am confused because it feels like we are taking only one step with the gradient vector, per a batch of observations.

Is it the case that .step() is actually taking many gradient steps?

Or is it possible that taking one step with a given batch then immediately receiving a new cost-function with different terms (different observations) is a good practice?

eqy · June 9, 2021, 8:14pm

No, only one gradient step is taken. You can view this step as somewhere inbetween gradient descent and pure stochastic gradient descent: Stochastic gradient descent - Wikipedia. The reason the typical approach is to do a single step per batch rather than a single step per example is a balance of algorithmic (the ideal would be to take a single step per example) and computational efficiency (batching increases throughput on most hardware).

sj27182 · June 9, 2021, 8:16pm

Brilliant, thank-you!

sj27182 · June 18, 2021, 2:20am

For future reference I want to point out that I was just a bit confused by the batches. After using a batch size the same size as the observations in the dataset, I got the “continuous” gradient descent values I was looking for. For some reason I thought each step generated by a batch in the epoch wasn’t immediately applied to the parameters, and instead some sort of “sequence of steps” were applied at the end of the epoch.
SORRY!