Cuda: parallels feed multiple batches

Hey;

I have a large data set X with size N and I want to feed into a network NN(X); I try to divide into batches and run on Cuda parallels.

Right now; my code is

loss = 0
for j in range(0,X.shape[0],batch_size):
        log_qx = Q.log_prob(X[j:j+ batch_size])

        loss += loss_func(trueVal= trueVal[j:j + batch_size],
                                     samples=X[j:j + batch_size],
                                     log_qx=log_qx,
                                     log_px=log_px)

loss.backward(retain_graph=True)
optimizer.step()

This is very slow and I’m wondering any tricks can be down to improve the efficiency Thanks

Use nvidia-smi to see if the GPU is working at near 100%
if it is not, then there might be bottle neck somewhere, profile your code
If it is, your only option is to use smaller model so that it’s faster, or buy more GPUs, better GPUs

I think its because of the for loop makes the training very slow because if I feed the full dataset instead of batch the training speed is ok

Then I guess your batch size is too small, make it bigger. Or just feed the full dataset like you already did