Interactions between Mixed Precision Training and Memory when using CUDA

TheeNinja · May 4, 2020, 7:02am

My understanding of mixed precision training is that there is a tensor of master weights that is FP32. Each iteration of training, a local copy of FP16 weights is made, which is used to do the forward-propagation and back-propagation.

This is the first interaction I’ve got a question about: If all the propagation and matrix multiplications (done on the GPU) only use this local copy of FP16 weights, does that mean that the FP32 master weights can be stored in CPU memory, with the FP16 weights being a copy from CPU memory -> GPU memory (that also halves precision) performed each iteration? I believe this is an implementation detail in PyTorch (all I do on my end is set precision=16 on the Trainer along with an amp_level), but an answer would help me understand the pipeline as a whole.

Moving from weights-side of things to the data side: I’ve defined this dataset:

class InMemoryLabeledAudioDataset(LabeledAudioDataset):
    def __init__(self, labeled_journal_entry_ids, feature_len, max_steps, model_category):
        super().__init__(labeled_journal_entry_ids, feature_len, max_steps)

        self.model_category = model_category
        expected_y_len = SIZED_SCORES[model_category]

        self.x_batch = torch.empty((len(self), self.feature_len, self.max_steps))
        self.y_batch = torch.empty((len(self), expected_y_len))

        for index in range(len(self)):
            self.x_batch[index], self.y_batch[index] = super().__getitem__(index)

    def __getitem__(self, index):
        return self.x_batch[index], self.y_batch[index]

LabeledAudioDataset (specifically its __getitem__) reads audio from disk, that’s just abstracted away. All the data I have, as F32, can all fit in CPU memory, but not GPU memory. If all data is as F16, it can fit in both CPU memory and GPU memory. I can think of 3 different ways to go about feeding this data into my network:

Store all data in CPU memory as F32, perform half-precision one-time copy to GPU memory. All mini-batches of data are now in GPU memory as F16, and can be read from each iteration.
Store all data in GPU memory as F16 (and no instance of the data will exist as F32, CPU memory will be freed).
Store all data in CPU memory as F32, perform full-precision local copy each iteration to GPU memory. Now each mini-batch of data will have be copied from CPU as F32, but because not all F32 mini-batches can exist in GPU memory at the same time, this local copy would have to be done each iteration.

This is the second interaction I’ve got a question about: which of these options is the reality DataLoader executes (could also be an option outside of the 3), and how explicit am I supposed to be in LabeledAudioDataset? By explicit, I mean whether I should be calling half() on self.x_batch and self.y_batch (even though my Trainer is already with precision=16), and whether I should call to(my_cuda_device) or not even though my Trainer was provided with gpus=1. In other words, what will be taken care of of vs. what am I supposed to do on my end explicitly?

I know that the goal of mixed precision training is to only have some specific tensors as F16, but is data one of those tensors? Will options 1 and 2 make the network perform sub-optimally loss-wise compared to option 3 because of the data’s lower precision? Finally, how do the pin_memory and num_workers DataLoader arguments change up things here?

The questions I have are all over the place, but I’ve read discussions about how to store data in GPU memory and what is done in Dataset vs. DataLoader, and haven’t been able to piece together the big picture.

ptrblck · May 4, 2020, 7:36am

Master parameters and gradients are used in apex/amp with opt_level='O2'.
The current PyTorch master (and the nightly binaries) contain the native mixed-precision implementation, which we recommend to use now, and you can have a look at the documentation here.

While that might be possible, you would most likely lose the performance benefits from using TensorCores due to the data transfer, so the master params are stored on the GPU.

You should not call half() on the data or the model and apex/amp or torch.cuda.amp.autocast take care of the casting for you.
I’m not sure where the Trainer class comes from, but I assume it’s from a high-level API?
If so, I would expect the same behavior, i.e. no half() calls.

These arguments to the DataLoader are completely independent from mixed-precision training.
num_workers will use multiprocessing, where each worker loads a batch in the background while your training is running.
pin_memory=True will use page-locked memory to speed up the data transfer between the host and device and allows you to use tensor.to(non_blocking=True) to an asynchronous transfer.

TheeNinja · May 4, 2020, 9:21am

Thank you for the great answer, whole pipeline is a lot clearer now.

My bad, Trainer comes from PyTorch Lightning, but still sounds like casting will be taken care of because it uses apex/amp under the hood.

I have one followup question:

All my data when in FP16 can fit entirely on the GPU, which to my understanding would be more performant compared to transferring data from host to device even with pin_memory=True and multiple workers.

The problem is that I can never store all the data on the GPU in FP32, even intermediately, because it doesn’t fit. I need to have the half() call applied before the to(cuda_device) / cuda() call. Is there a way for me to move the data to GPU after the half() call that is done for me by amp/apex or torch.cuda.amp? In other words, is there a way for the half() to be done on the CPU tensor for the entire data batch, and have the tensor copied to GPU one-time after?

ptrblck · May 4, 2020, 7:48pm

You could transform the data on the CPU to half, push them to the device, and transform the batch back to float before calling the forward function.
Note that your training also needs to store intermediate tensors to be able to calculate the gradients, but I assume you’ve already checked it.