Standard way to go through 1 epoch in SGD with a Model (especially NN)

I was looking at some code for logistic regression and I came across the following:

    for i in range(100):
        cost = 0.
        num_batches = n_examples // batch_size
        for k in range(num_batches):
            start, end = k * batch_size, (k + 1) * batch_size
            cost += train(model, loss, optimizer,
                          trX[start:end], trY[start:end])
        predY = predict(model, teX)
        print("Epoch %d, cost = %f, acc = %.2f%%"
              % (i + 1, cost / num_batches, 100. * np.mean(predY == teY)))

is this the standard way to process 1 epoch? It seemed odd to me cuz it extracted contigent chunks of data rather than random…


1 Like

One epoch is usually defined to be one complete run through all of the training data. The easiest way of doing that is to go through the data in contiguous chunks.

If you pick the samples for each batch at random then you will face two problems.

  1. You will have to copy the batch data before running it through the model, because models generally need contiguous data in each batch.
  2. You will have to keep track of which samples the model has seen in order to be able to tell when the model has seen all the training samples.

respond to 2) I always thought it didn’t really matter if one went exactly through all the data (unless there is some empirical study or theorem I am not aware of that shows why contingent pass of the data is desirable). As far as I know the strengths of SGD is because its actually stochastic (e.g. https://cbmm.mit.edu/publications/musings-deep-learning-properties-sgd). I am not sure if doing such a deterministic way of going through the data (i.e. contiguous) breaks the benefits of the stochasticity.

respond to 1) Models need contiguous data in each batch? what? what do you mean they “need”? They need because of the code or because of statistical benefit? I’ve never heard that before.

Non-stochastic gradient descent involves making exactly one update per epoch. True stochastic gradient descent makes progress considerably faster because it makes one update per input sample. A common compromise is to split the data into batches and make one update per batch. The simple fact that you split the data up into small batches provides some level of stochasticity even if the small batches are identical from one epoch to the next. Random sampling would provide more stochasticity, but incurs the penalty of having to copy the data. Shuffling the batches is frequently used to add some more stochasticity without incurring the penalty inherent in coping the data.

Models need contiguous data in each batch because the operations they perform are designed to work with contiguous data since that allows the code to be significantly optimised. Matrix operations can be efficiently parallelised when you have contiguous batches of matrices to operate on.

There is no requirement to go through the data exactly once, and as far as I know, there is no statistical reason why it would be preferable to random sampling. In fact if you read the code you posted carefully, you will see that if n_examples is not exactly divisible by batch_size a few of the training samples are never used.

An epoch is usually defined as a single pass through the training data, but really it is just a fixed length of training that we use for evaluating training progress.

ok, it seems there were some confusions. When I said SGD I meant as its normally (informally) implied, mini batch stochastic gradient descent. I didn’t actually expect 1 point at a time and only seeing the data only once (true SGD).

Why does the additional stochasticity mean we copy data? Can’t we just do it in place? I don’t understand what part of the code would copy the data without the coders permission. Is the line you are referring to:

            cost += train(model, loss, optimizer,
                      trX[start:end], trY[start:end])

does the indexing of the FloatTensor copy the data unnecessarily?

To be clear, the code you pasted doesn’t copy the data unnecessarily. In particular, that line doesn’t copy the data. It creates a batch from an already contiguous block of samples - this doesn’t require the data to be copied.

If you want to make a batch using a random selection of samples instead of using a block of contiguous samples, then you would have to explicitly make sure the data was contiguous before feeding it to the model and that would generally require copying. Something like this should work.

random_indices = torch.LongTensor(np.random.randint(start, end, size=batch_size))
batchX = trX[random_indices].contiguous()
batchY = trY[random_indices].contiguous()
cost += train(model, loss, optimizer, batchX, batchY)

tensor.contiguous() only copies the data if it has to. If random_indices happens to be a sequential list of indices, then the data is already contiguous and .contiguous() just returns a view to the original data.

I hope that I am not adding to your confusion.

what I wanted to do is scramble the data set every epoch so that every epoch the algorithm choses a different batch when it trains. However, I am unsure how to do that without doing unnecessary copying of the data. Do you know how to do that?

Thanks for the discussions :slight_smile:

I guess I now realize this is very related to:

I don’t think you can produce batches of random samples without needing to copy the data. However, there are some things you can do to improve the stochasticity of the training.

The code you posted takes batches in order, starting from the first sample available and maybe leaving a few samples unused at the end when there aren’t enough left to fill a batch.

for i in range(100):
    num_batches = n_examples // batch_size
    for k in range(num_batches):
        start, end = k * batch_size, (k + 1) * batch_size

You could improve this by randomly skipping a few samples at the beginning of the list instead of discarding samples only at the end of the list…

for i in range(100):
    num_batches = n_examples // batch_size
    unused_examples = n_examples % epoch_batch_size
    random_start = np.random.randint(unused_examples)
    for k in range(num_batches):
        start, end = random_start + k * batch_size, random_start + (k + 1) * batch_size

Another thing you can do is to randomise the order of the batches…

for i in range(100):
    num_batches = n_examples // batch_size
    unused_examples = n_examples % epoch_batch_size
    random_start = np.random.randint(unused_examples)
    for k in np.random.permutation(range(num_batches)):
        start, end = random_start + k * batch_size, random_start + (k + 1) * batch_size

Yet another idea would be to randomise the size of the batches within a reasonable range…

for i in range(100):
    epoch_batch_size = np.random.randint(int(batch_size*.9), int(batch_size*1.1))
    num_batches = n_examples // epoch_batch_size
    unused_examples = n_examples % epoch_batch_size
    random_start = np.random.randint(unused_examples)
    for k in np.random.permutation(range(num_batches)):
        start, end = random_start + k * epoch_batch_size, random_start + (k + 1) * epoch_batch_size

If you really need random batches tensor.index_select(0, indices) might be more efficient though it still copies the data.

1 Like