Apex: memory gains in detectron2

soldierofhell · January 30, 2020, 12:21pm

Hi, I modified detectron2 Mask X152 model for apex training. Besides standard “three lines of code” change in trainer’s class, I’ve decorated roialign and deconv custom layers to force 32-bit for them.
My main goal was to reduce memory, but unfortunately it didn’t happen. Although after reading brief description of recommended opt_level=O1 it’s not clear to me, if I should expect them at all. Seems like only computations are done in 16bit (after casting), but “storage” is mostly 32bit.

Should I try to adjust code to opt_level=O2?

ptrblck · January 30, 2020, 11:16pm

You could still save memory with opt_level='O1', as activations might be stored in FP16 as seen in this dummy code snippet:

import torch
import torch.nn as nn

import torchvision.models as models

from apex import amp

use_amp = True

model = models.resnet50()
model.cuda()


model.avgpool.register_forward_hook(lambda m, x, y: print(x[0].type(), y[0].type()))
x = torch.randn(1, 3, 224, 224).cuda()

torch.cuda.synchronize()
print(torch.cuda.max_memory_allocated()/1024**2)

if use_amp:
    model = amp.initialize(model, opt_level='O1')
output = model(x)

torch.cuda.synchronize()
print(torch.cuda.max_memory_allocated()/1024**2)

besides the speedup using TensorCores.

soldierofhell · January 31, 2020, 1:44pm

Thank you @ptrblck, I thought about using forward_hook. Indeed starting from ‘O1’ we see torch.cuda.HalfTensor, but…
Here are the memory results for O0, O1 and O2 for RTX 2070 SUPER. How can we explain higher memory usage by O1? (same one can observe on google colab)

O0
check if max_memory strarts from zero:  0.0
memory after model and data are loaded:  98.30224609375
Selected optimization level O0:  Pure FP32 training.

Defaults for this optimization level are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O0
cast_model_type        : torch.float32
patch_torch_functions  : False
keep_batchnorm_fp32    : None
master_weights         : False
loss_scale             : 1.0
memory after forward:  181.4619140625

O1
check if max_memory strarts from zero:  0.0
memory after model and data are loaded:  98.30224609375
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
memory after forward:  192.01171875

O2
check if max_memory strarts from zero:  0.0
memory after model and data are loaded:  98.30224609375
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
memory after forward:  98.892578125

Modified code snippet:

import torch
import torch.nn as nn

import torchvision.models as models

from apex import amp

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

opt_level = 'O1'

print(opt_level)

seed = 123
torch.manual_seed(seed)

torch.cuda.reset_max_memory_allocated()

print('check if max_memory strarts from zero: ', torch.cuda.max_memory_allocated()/1024**2)

model = models.resnet50()
model.cuda()

#model.avgpool.register_forward_hook(lambda m, x, y: print(x[0].type(), y[0].type()))
x = torch.randn(1, 3, 224, 224).cuda()

torch.cuda.synchronize()
print('memory after model and data are loaded: ', torch.cuda.max_memory_allocated()/1024**2)

model = amp.initialize(model, opt_level=opt_level)
output = model(x)

torch.cuda.synchronize()
print('memory after forward: ', torch.cuda.max_memory_allocated()/1024**2)

ptrblck · January 31, 2020, 4:24pm

Could you also check the allocated memory via torch.cuda.memory_allocated()?

soldierofhell · January 31, 2020, 4:34pm

@ptrblck, exactly the same, so max is equal current. This is from pytorch ‘1.4.0+cu100’

ptrblck · February 1, 2020, 5:10am

Thanks for the information.
Since you’ve set torch.backends.cudnn.deterministic = True, the applied algorithms might have a different memory requirement for different data types, which seems to be higher in O1 than in O0 for this use case.

Walid_Ahmed_pyTorch · September 1, 2020, 6:50pm

@soldierofhell
I am trying to use O2 but roialign is giving me the following error
"File “/opt/conda/lib/python3.7/site-packages/torchvision/ops/roi_align.py”, line 45, in roi_align
sampling_ratio, aligned)
RuntimeError: Expected tensor for argument #1 ‘input’ to have the same type as tensor for argument #2 ‘rois’; but type torch.cuda.HalfTensor does not equal torch.cuda.FloatTensor (while checking arguments for ROIAlign_forward_cuda)
"
Can you please guide me how to resolve this error?

Thanks
Walid

ptrblck · September 2, 2020, 9:38am

I would recommend to use the native mixed precision package, which should support the current torchvision models with their custom extensions.

Walid_Ahmed_pyTorch · October 1, 2020, 1:49am

Thanks a lot.
This native api gave me the following error:
detectron2/modeling/poolers.py", line 249, in forward
output[inds] = pooler(x[level], pooler_fmt_boxes_level)
RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and Half for the source.

Can you please help me how to resolve it.

Also how can I use O1 and O2 levels too like we do in https://nvidia.github.io/apex/amp.html?

Walid

ptrblck · October 1, 2020, 3:05am

Native amp doesn’t support different opt_levels and is similar to O1.
Could you install the latest nightly binaries and check, if the error is still raised, please?

Walid_Ahmed_pyTorch · October 1, 2020, 1:15pm

Thanks, I will give it a try but my main aim is o2. I already got o1 working with nvidia Apex