My understanding of mixed precision training is that there is a tensor of master weights that is FP32. Each iteration of training, a local copy of FP16 weights is made, which is used to do the forward-propagation and back-propagation.
This is the first interaction I’ve got a question about: If all the propagation and matrix multiplications (done on the GPU) only use this local copy of FP16 weights, does that mean that the FP32 master weights can be stored in CPU memory, with the FP16 weights being a copy from CPU memory -> GPU memory (that also halves precision) performed each iteration? I believe this is an implementation detail in PyTorch (all I do on my end is set precision=16
on the Trainer
along with an amp_level
), but an answer would help me understand the pipeline as a whole.
Moving from weights-side of things to the data side: I’ve defined this dataset:
class InMemoryLabeledAudioDataset(LabeledAudioDataset):
def __init__(self, labeled_journal_entry_ids, feature_len, max_steps, model_category):
super().__init__(labeled_journal_entry_ids, feature_len, max_steps)
self.model_category = model_category
expected_y_len = SIZED_SCORES[model_category]
self.x_batch = torch.empty((len(self), self.feature_len, self.max_steps))
self.y_batch = torch.empty((len(self), expected_y_len))
for index in range(len(self)):
self.x_batch[index], self.y_batch[index] = super().__getitem__(index)
def __getitem__(self, index):
return self.x_batch[index], self.y_batch[index]
LabeledAudioDataset
(specifically its __getitem__
) reads audio from disk, that’s just abstracted away. All the data I have, as F32, can all fit in CPU memory, but not GPU memory. If all data is as F16, it can fit in both CPU memory and GPU memory. I can think of 3 different ways to go about feeding this data into my network:
-
Store all data in CPU memory as F32, perform half-precision one-time copy to GPU memory. All mini-batches of data are now in GPU memory as F16, and can be read from each iteration.
-
Store all data in GPU memory as F16 (and no instance of the data will exist as F32, CPU memory will be freed).
-
Store all data in CPU memory as F32, perform full-precision local copy each iteration to GPU memory. Now each mini-batch of data will have be copied from CPU as F32, but because not all F32 mini-batches can exist in GPU memory at the same time, this local copy would have to be done each iteration.
This is the second interaction I’ve got a question about: which of these options is the reality DataLoader
executes (could also be an option outside of the 3), and how explicit am I supposed to be in LabeledAudioDataset
? By explicit, I mean whether I should be calling half()
on self.x_batch
and self.y_batch
(even though my Trainer
is already with precision=16
), and whether I should call to(my_cuda_device)
or not even though my Trainer
was provided with gpus=1
. In other words, what will be taken care of of vs. what am I supposed to do on my end explicitly?
I know that the goal of mixed precision training is to only have some specific tensors as F16, but is data one of those tensors? Will options 1 and 2 make the network perform sub-optimally loss-wise compared to option 3 because of the data’s lower precision? Finally, how do the pin_memory
and num_workers
DataLoader
arguments change up things here?
The questions I have are all over the place, but I’ve read discussions about how to store data in GPU memory and what is done in Dataset
vs. DataLoader
, and haven’t been able to piece together the big picture.