Is batching only once legitimate?

Hi,

I’m dealing with data that batches really slowly (I’m dealing with graphs and sparse matrices).

I was wondering if it’s okay/won’t hurt performance if I batch my data once before training and just feed the model those same batches in a random order, as depicted bellow:

    loader = DataLoader(my_data, batch_size=batch_size, shuffle=True)
    batches = [batch for batch in loader]
    for epoch in range(num_epochs):
        random.shuffle(batches)
        for batch in batches:
            batch = batch.clone()
            batch = batch.to(device)
            ...

Is this an okay thing to do?

Hi Viktor!

This should be fine. Because you select your batches randomly (even though
you only do this once) and in each epoch you iterate through your batches in
a different order, you should be almost as well off as if you had selected new
random batches for every epoch.

As an aside, why is batching your data so slow? Do you have an expensive
collate_fn() in your pipeline? Even if preparing individual samples is
expensive, you could do that part just once and then randomly batch the
already-prepared samples together for each epoch using, for example,
stack(). That should be altogether pretty cheap.

Best.

K. Frank

1 Like