Dear Community,
I was wondering whether I can further reduce memory consumption on a specific operation that I identified that is using a lot of memory. I noticed that 1) the A and B matrix both as float.32 and 2) the resulting coefficients c also float.32, lead to rather large memory usage.
While I can manually cast A and B to float.16 so that the resulting coefficients c are also float.16, leading to quite reduced memory constraint, I was wondering whether it is possible to use mixed precision instead of manually casting the tensors to float.16 ?
I am not sure whether this is necessary, since we dont need gradients on A and B but on the coefficients c.
This is my code:
self.A = torch.from_numpy(GBTA).to(device, dtype=torch.float16).requires_grad_(False)
self.B = torch.from_numpy(GBTB).to(device, dtype=torch.float16).requires_grad_(False)
c_t = F.linear(c_t.half(), self.A[t].half()) + self.B[t].squeeze(-1) * input
I am not sure how to wrap gad scalar into this, as we already do auto-cast mixed precision and pass the scalar from gradscalar into the training function:
scaler = GradScaler()
train_loss, train_class_acc, train_noobj_acc, train_obj_acc = (
trainholov4_enas_vid_bptt(
device,
train_loader,
model,
optimizer,
scheduler,
loss_f,
scaled_anchors,
scaler,
conf_thresh=0.8,
mode="ciou",
target_batch_size=args.target_batch_size,
ngpus=args.ngpus,
)
)
So it may not be needed to add gad scalar and maybe something like this would just work ?
self.A = torch.from_numpy(GBTA).to(device, dtype=torch.float16).requires_grad_(False)
self.B = torch.from_numpy(GBTB).to(device, dtype=torch.float16).requires_grad_(False)
with autocast():
c_t = F.linear(c_t.half(), self.A[t].half()) + self.B[t].squeeze(-1) * input
Please let me know, I would be happy to hear about suggestions or answers.