How to model GD without having memory issues

I’m trying to model GD (not sgd), means that I want to take all the data forward and only then backward.
No doubt that we will need to split it , and I tried it:

    for batch_idx, (data, target) in enumerate(train_loader):
       if args.cuda:
          data, target = data.cuda(), target.cuda()
       output = model(data)
       print("One iter is out",output.shape)
       loss = criterion(output, target)
   loss = loss_array.mean()

As you can see, I split into batchs for the forward pass, arrange all in a small loss array, and only then make backward pass, the problem is that I’m getting out of memory issues, and it’s weird because the difference from the normal run is saving a small loss tensor, which is negligable.

I’ll be happy for any Idea or advice.

That’s not the case, since each tensor chunk would also store the entire computation graph, so you would use the same memory in the end.
You could simulate the larger batch size using one of the approaches described in this post.