Storing batches form DataLoader in a Buffer/List for later training

Hello, I have a problem and I cannot figure out what I am doing wrong. I want to get samples from a pytorch DataLoader and store them in a buffer, so that I can use them for training later. The normal way to do a train-loop (without the buffer) would look something like this:

for epoch in range(epochs):
  for data, target in data_loader:
      ... # loss, backprop etc

But instead, I want to do something like this

buffer = []
for data, target in data_loader:
  buffer.append((data, target))

for epoch in range(epochs):
  for data, target in buffer:
      ... # loss, backprop etc
  random.shuffle(buffer)

As far as I understand, both methods should approximately produce about the same results. The second method converges, but it performs significantly worse than the first, without changing anything else in the code (except for adding the buffer, of course). I already experimented with cutting apart the batches and shuffling them around but nothing has worked so far. I would greatly appreciate any help, because I am struggling with this for a week. Thanks!

(PS: my real code is more complex, but it all boils down to this)

The main difference would be that the second approach would disable the data augmentation.
I.e. if you are using random transformations in your Dataset, the first approach would apply them on each sample on-the-fly during the data loading and processing in for data, target in data_loader.
In your second approach you’ve applied all (random) transformations once and stored this processed data. For the model, the samples are thus identical and no data augmentation is used, which can make the convergence worse.

1 Like