Hey,
here I try to, as detailled as possible, to describe my out of memory error.
I am currently trying to implement a large scale super-resolution model that incorporates residual blocks and a subpixel upscaling layer.
My model has 42 million parameters and operates on an input image of size (3, 180, 320) with a batch size of 1 since it is so large that only a single image can pass through at once. Evaluation works fine with this model but training produces an out of memory error during the second epoch (i.e. the second image) that is passed through the network. I am trying to understand why exactly this happens since I tried to check which tensors remain in memory to no avail.
I won’t post the whole network code since it is long and complex and I don’t think the error lies in the network code (correct me if I’m wrong and I can send the network code too).
Here is the training loop:
while True:
#this function loads an image from a video and in this case this is a list of a list
#of a single numpy array of shape (1080, 1920, 3)
img = video_loader.load_batch(1,1,1, rpn_model.num_input_imgs, 0.0, 0.0, seed = 19)
#here is the training function, the name comes from this script being
#a test of how large the model can be during training
test_scale(img, rpn_model, downscale_factor, bicubic_scale_factor, net_scale_factor, optimizer)
and the content of the training loop (I tried to comment it as clearly as possible, if you have questions, please ask)
def test_scale(img, rpn_model, downscale_factor, bicubic_scale_factor, net_scale_factor, optimizer):
#img is a full hd, rgb color image as a numpy array with shape (1080, 1920, 3)
#rpn_model is a 43M parameter resnet subpixel super resolution model
#downscale_factor = 2, bicubiy_scale_factor = 1.5, net_scale_factor = 3, optimizer = Adam
print(torch.cuda.memory_allocated() / 1000 / 1000)
print(torch.cuda.max_memory_allocated() / 1000 / 1000)
print(torch.cuda.memory_reserved() / 1000 / 1000)
print(torch.cuda.max_memory_reserved() / 1000 / 1000)
print(torch.cuda.memory_summary())
optimizer.zero_grad()
#crop the input image by reducing height and with by factor of 2
new_width = int(img[0][0].shape[1] / downscale_factor)
new_height = int(img[0][0].shape[0] / downscale_factor)
offset_x = int((img[0][0].shape[1] - new_width) / 2)
offset_y = int((img[0][0].shape[0] - new_height) / 2)
img_crop = img[0][0][offset_y:offset_y + new_height,
offset_x:offset_x + new_width,
:]
#convert numpy array to tensor, this is the target HR image, shape is (1, 3, 540, 960)
img_crop_tens = tensorfy_img(img_crop,img_crop.shape[1], img_crop.shape[0])
#scale down image and scale it up again to introduce information loss
smallest_img = cv2.resize(img_crop,
dsize=(int(img_crop.shape[1] / net_scale_factor / bicubic_scale_factor),
int(img_crop.shape[0] / net_scale_factor / bicubic_scale_factor)))
input_img = cv2.resize(img_crop, dsize=(int(img_crop.shape[1] / net_scale_factor),
int(img_crop.shape[0] / net_scale_factor)))
#convert input image to tensor, shape is (1, 3, 180, 320)
input_img = tensorfy_img(input_img, input_img.shape[1], input_img.shape[0])
#create model output, output is single tensor
model_output = rpn_model(input_img)
#Simple mean absolute error, function inputs are swapped compared to torch loss order
loss = network.SimpleRLoss(img_crop_tens, model_output)
print("Loss during batch training:", loss.item())
loss.backward()
optimizer.step()
#these deletions and cache empty calls do not fix the error, even when detaching all tensors beforehand
#del img_crop_tens
#del input_img
#del model_output
#del loss
#del rpn_model
#del img_crop, smallest_img, img
#del optimizer
#torch.cuda.empty_cache()
#this function shows that the model weights (or gradients?) are apparently not cleared from memory
#dump_tensors
as for helper functions in this script, we have the numpy array to tensor function
def tensorfy_img(img, width, height):
img_tensor = torch.from_numpy(np.transpose(img, (2,0,1))).float().to(device=comp_device)
img_tensor = img_tensor.reshape(1,3,height, width)
return img_tensor
and the loss function
def SimpleRLoss(in_hr, out_hr):
#recon_loss = nn.MSELoss()
mae_loss = nn.L1Loss()
loss = mae_loss(out_hr, in_hr)
return loss
and the dump tensors function taken from another forum post (which I sadly do not have the link anymore, sorry)
def dump_tensors(gpu_only=True):
"""Prints a list of the Tensors being tracked by the garbage collector."""
import gc
total_size = 0
for obj in gc.get_objects():
try:
if torch.is_tensor(obj):
if not gpu_only or obj.is_cuda:
print("%s:%s%s %s" % (type(obj).__name__,
" GPU" if obj.is_cuda else "",
" pinned" if obj.is_pinned else "",
pretty_size(obj.size())))
total_size += obj.numel()
elif hasattr(obj, "data") and torch.is_tensor(obj.data):
if not gpu_only or obj.is_cuda:
print("%s → %s:%s%s%s%s %s" % (type(obj).__name__,
type(obj.data).__name__,
" GPU" if obj.is_cuda else "",
" pinned" if obj.data.is_pinned else "",
" grad" if obj.requires_grad else "",
" volatile" if obj.volatile else "",
pretty_size(obj.data.size())))
total_size += obj.data.numel()
except Exception as e:
pass
print("Total size:", total_size)
The individual elements work in another script (with slightly different arrangement) for a smaller network fine, even for a large number of epochs. This makes me wonder if there is some base level of memory allocation that can’t be cleared and requires me to downsize my network.
Here are some memory stats before the first epoch:
168.154624 #memory allocated in MB
168.154624 #max memory allocated in MB
190.840832 #memory reserved in MB
190.840832 #max memory reserved in MB
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 164213 KB | 164213 KB | 164213 KB | 0 B |
| from large pool | 163872 KB | 163872 KB | 163872 KB | 0 B |
| from small pool | 341 KB | 341 KB | 341 KB | 0 B |
|---------------------------------------------------------------------------|
everything looks fine so far.
Here are the same stats before the second epoch:
673.896448 #memory allocated in MB
6103.84896 #max memory allocated in MB
6316.621824 #memory reserved in MB
6320.8161279999995 #max memory reserved in MB
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 3 | cudaMalloc retries: 3 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 658102 KB | 5821 MB | 22377 MB | 21734 MB |
| from large pool | 656736 KB | 5819 MB | 22373 MB | 21732 MB |
| from small pool | 1366 KB | 2 MB | 3 MB | 2 MB |
|---------------------------------------------------------------------------|
As you can see, there is now approx. 660 MB of memory still allocated even after deleting all variables/tensors and emptying the cache. And when I dump the tensors I can see that there are still some weights/gradients(?) in memory:
Tensor: GPU pinned 1 × 3 × 540 × 960
Tensor: GPU pinned 1 × 3 × 180 × 320
Tensor: GPU pinned 1 × 3 × 540 × 960
Tensor: GPU pinned
Tensor: GPU pinned 256 × 3 × 9 × 9
Tensor: GPU pinned 256 × 3 × 9 × 9
Tensor: GPU pinned 256
Tensor: GPU pinned 256
Tensor: GPU pinned 256 × 256 × 3 × 3
Tensor: GPU pinned 256 × 256 × 3 × 3
Tensor: ...
Parameter: GPU pinned 256 × 3 × 9 × 9
Parameter: GPU pinned 256
Parameter: GPU pinned 256 × 256 × 3 × 3
Parameter: GPU pinned 256
Parameter: GPU pinned 256 × 256 × 3 × 3
Parameter: GPU pinned 256
Parameter: ...
So it seems that there is some leftover where there shouldn’t be one, and these 500 MB of memory is what I need to train this network.
I read a lot of these kinds of posts and tried to debug more but I am at the limit of my knowledge. Is this standard behaviour and I just need to downsize my model or am I doing something wrong?
Any help is appreciated! Thank you.