Finding maximal batch size according to GPU size

Hello everyone,

I have been working on a code to train a neural network.
and right now I’m working on a feature that finds the maximum batch size that can fit into memory
for a given model and a training set.

So here is my code:

def get_free_memory():
    import GPUtil
    CUDA_VISIBLE_DEVICES = os.environ.get('CUDA_VISIBLE_DEVICES')
    memory = 0
    for GPU in GPUtil.getGPUs():
        if CUDA_VISIBLE_DEVICES is None or str(GPU.id) in CUDA_VISIBLE_DEVICES: 
            memory += GPU.memoryFree
    return memory


def no_free_mem( mem_per_sample, available ):
    return 5*np.array(mem_per_sample).max() > available

def main():
    model = PytorchModel(config_network, config_inputs, config_outputs, "")
    model = model.cuda()

    if torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)

    max_len = config_network["max_sequence_length"]

    x = get_first_sample(X, max_len, model.inputs_cfg)
    y = get_first_sample(Y, max_len, model.outputs_cfg, output=True)
    
    optimizer = optim.Adam(filter(lambda param: param.requires_grad, model.parameters()))

    moremem = True
    batch_size = 1
    prev_freemem = get_free_memory()
    mem_per_sample = [0]
    optimizer.zero_grad()
    
    while moremem:
        y_pred, _, _ = model(x)
        freemem = get_free_memory()
        if no_free_mem( mem_per_sample, freemem ): break
        
        loss, _ = model.loss(y_pred, y)
        
        freemem = min(freemem, get_free_memory())
        if no_free_mem( mem_per_sample, freemem ): break

        loss.backward()
        freemem = min(freemem, get_free_memory())
        if no_free_mem( mem_per_sample, freemem ): break

        optimizer.step()
        freemem = min(freemem, get_free_memory())
        if no_free_mem( mem_per_sample, freemem ): break

        if prev_freemem - freemem > 0:
            mem_per_sample.append(prev_freemem - freemem)

        if no_free_mem( mem_per_sample, freemem ): break
    
        batch_size += 1
        prev_freemem = min(prev_freemem, freemem)

        x = insert_sample(x)
        y = insert_sample(y)
        
    print("GUESSING batch_size, ", batch_size)

I compute how much GPU memory is available at each step of the forward and backward passes
And I expand the batch size iteratively until memory saturation. I keep track of how much memory is required for each sample in order to predict if I’ll be having enough memory to insert a new sample on the batch.
In my head, this should work. however, this code gets unstable behavior.
Sometimes it works fine and sometimes it crashes due to out of memory errors.
I have tried to increase the margin of required memory on the stop condition, it crashes less but nothing seems to work 100%
So I guess I must be missing something important here.

A pattern that I have observed is that crashes are specially frequent when I do multi-task learning.
And they occur in the loss.backward step most of the time.

Does someone could help me fix this problem or provide with another way to estimate memory-optimal batch sizes?
I have found some equations in this forum but none of them worked for me in all the architectures I tested.

Thank you very much!
Gabriel M

1 Like

Hi Gabriel (@gabriel ),

Another issue that you should consider while implementing such a thing is that in many models in neural networks, batch_size is a very sensitive parameters which affects the performance. It would be one thing to find out the best batch size for the entire training purpose and then keep it constant. But since you are changing it at every step, it might lead to instability in performance as well. I would advice you to test this hypothesis as well.

In the inference phase, adaptive batchsize is also necessary.

I have tried to achieve the similar feature like yours, and I encountered similar problem.

I guess the major problems are as follows:

  1. In the inference phase or BP phase, some tensors/grads can be deleted when corresponding specific computations are finished. However, in the beginning, pytorch does not know when to delete which tensor/grad, as it is a dynamic-graph framework, so the memory cost is high. After several passes, pytorch knows the architecture of CNNs, and delete tensors/grads as soon as possible in subsequent passes, so the memory cost is low.
  2. PyTorch chooses base computation method according to batchsize and other situations, so the memory cost is not only related to batchsize.

Hi @YichengWang, regarding what you said in 1, looks like after a few passes, pytorch will do the deletion automatically, is this confirmed somewhere in the pytorch documentation?

Also what is the motivation that we need to tune batch size when training a dnn model?
Is it because we want to use 100% of the GPU memory so that it can speed up the training process?
What if I don’t use 100% of the GPU, but say 80%, then my model would still converge, but it will take more batches to finish the training?

@ecolss I think it is to speed up the training process, yes.

Why would you want to use only 80% of the GPU’s memory?

@YichengWang But can PyTorch really know whether a tensor will be used before it has been destroyed? I mean, depending on how the Python code is written, a tensor may be unused for a long time before its destruction in almost all passes, but suddenly in one pass it is used right before it is destroyed. If then PyTorch has decided that it can free up the memory on the GPU for that tensor early on, it has fucked up. Right?