1080 res training CUDA out of memory on V100

Hello,

I’d like to ask, with my dataset and machine, if it is normal to see out of memory or if I might have some programming issue.

RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 15.78 GiB total capacity; 14.02 GiB already allocated; 198.19 MiB free; 14.13 GiB reserved in total by PyTorch)

I using AWS p3.8xlarge(4 Tesla V100s), trying to train a CycleGAN with 5055 images of 1024x1024 resolution.

I checked that resized dataset of 512x512 works with batch size 4, but with 1024x1024, even batch size 1 doesn’t work.

I think we need p4d.24xlarge for this project, but it’s hard to get the instance due to the lack of zone capacity.

possible tries are:
-reduce num of the dataset (but I think 5055 images are still small for training)
-find a memory leak?

any comments or hints are appreciated.

Below is the log for reference

train.py --dataroot database/face2smile \
>   --model cycle_gan \
>   --log_dir logs/cycle_gan/face2smile/teacher_1080 \
>   --netG inception_9blocks \
>   --real_stat_A_path real_stat_1080/face2smile_A.npz \
>   --real_stat_B_path real_stat_1080/face2smile_B.npz \
>   --batch_size 1 \
>   --num_threads 1 \
>   --gpu_ids 0,1,2,3 \
>   --norm_affine \
>   --norm_affine_D \
>   --channels_reduction_factor 6 \
>   --kernel_sizes 1 3 5 \
>   --save_latest_freq 10000 --save_epoch_freq 5 \
>   --nepochs 1 --nepochs_decay 0 \
>   --preprocess none
----------------- Options ---------------
                active_fn: nn.ReLU                       
              active_fn_D: nn.LeakyReLU                  
             aspect_ratio: 1.0                           
               batch_size: 4                             	[default: 1]
                    beta1: 0.5                           
                 channels: None                          
channels_reduction_factor: 6                             	[default: 1]
          cityscapes_path: database/cityscapes-origin    
                crop_size: 256, 256                      
                 dataroot: database/face2smile           	[default: None]
             dataset_mode: unaligned                     
                direction: AtoB                          
          display_winsize: 256                           
                 drn_path: drn-d-105_ms_cityscapes.pth   
             dropout_rate: 0                             
               epoch_base: 1                             
          eval_batch_size: 1                             
                 gan_mode: lsgan                         
                  gpu_ids: 0,1,2,3                       	[default: 0]
                init_gain: 0.02                          
                init_type: normal                        
                 input_nc: 3                             
                  isTrain: True                          	[default: None]
                iter_base: 1                             
             kernel_sizes: [1, 3, 5]                     	[default: [3, 5, 7]]
                 lambda_A: 10.0                          
                 lambda_B: 10.0                          
          lambda_identity: 0.5                           
           load_in_memory: False                         
                load_size: 286                           
                  log_dir: logs/cycle_gan/face2smile/teacher_1080	[default: logs]
                       lr: 0.0002                        
           lr_decay_iters: 50                            
                lr_policy: linear                        
         max_dataset_size: -1                            
                    model: cycle_gan                     	[default: pix2pix]
     moving_average_decay: 0.0                           
moving_average_decay_adjust: False                         
moving_average_decay_base_batch: 32                            
               n_layers_D: 3                             
                      ndf: 64                            
                  nepochs: 1                             	[default: 100]
            nepochs_decay: 0                             	[default: 100]
                     netD: n_layers                      
                     netG: inception_9blocks             
                      ngf: 64                            
                  no_flip: False                         
                     norm: instance                      
              norm_affine: True                          	[default: False]
            norm_affine_D: True                          	[default: False]
             norm_epsilon: 1e-05                         
            norm_momentum: 0.1                           
             norm_student: instance                      
 norm_track_running_stats: False                         
              num_threads: 32                            	[default: 4]
                output_nc: 3                             
             padding_type: reflect                       
                    phase: train                         
                pool_size: 50                            
               preprocess: none                          	[default: resize_and_crop]
               print_freq: 100                           
         real_stat_A_path: real_stat_1080/face2smile_A.npz	[default: None]
         real_stat_B_path: real_stat_1080/face2smile_B.npz	[default: None]
         restore_D_A_path: None                          
         restore_D_B_path: None                          
         restore_G_A_path: None                          
         restore_G_B_path: None                          
           restore_O_path: None                          
          save_epoch_freq: 5                             	[default: 20]
         save_latest_freq: 10000                         	[default: 20000]
                     seed: 233                           
           serial_batches: False                         
               table_path: datasets/table.txt            
          tensorboard_dir: None                          
----------------- End -------------------
train.py --dataroot database/face2smile --model cycle_gan --log_dir logs/cycle_gan/face2smile/teacher_1080 --netG inception_9blocks --real_stat_A_path real_stat_1080/face2smile_A.npz --real_stat_B_path real_stat_1080/face2smile_B.npz --batch_size 4 --num_threads 32 --gpu_ids 0,1,2,3 --norm_affine --norm_affine_D --channels_reduction_factor 6 --kernel_sizes 1 3 5 --save_latest_freq 10000 --save_epoch_freq 5 --nepochs 1 --nepochs_decay 0 --preprocess none
dataset [UnalignedDataset] was created
The number of training images = 5055
data shape is: channel=3, height=1024, width=1024.
initialize network with normal
initialize network with normal
initialize network with normal
initialize network with normal
dataset [SingleDataset] was created
dataset [SingleDataset] was created
/home/ubuntu/.local/lib/python3.9/site-packages/torchvision/models/inception.py:80: FutureWarning: The default weight initialization of inception_v3 will be changed in future releases of torchvision. If you wish to keep the old behavior (which leads to long initialization times due to scipy/scipy#11299), please set init_weights=True.
  warnings.warn('The default weight initialization of inception_v3 will be changed in future releases of '
model [CycleGANModel] was created
---------- Networks initialized -------------
DataParallel(
  (module): InceptionGenerator(
    (down_sampling): Sequential(
      (0): ReflectionPad2d((3, 3, 3, 3))
      (1): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1))
      (2): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (3): ReLU(inplace=True)
      (4): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (5): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (6): ReLU(inplace=True)
      (7): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (8): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (9): ReLU(inplace=True)
    )
    (features): Sequential(
      (0): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (1): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (2): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (3): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (4): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (5): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (6): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (7): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (8): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
    )
    (up_sampling): Sequential(
      (0): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (1): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (2): ReLU(inplace=True)
      (3): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (4): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (5): ReLU(inplace=True)
      (6): ReflectionPad2d((3, 3, 3, 3))
      (7): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1))
      (8): Tanh()
    )
  )
)
[Network G_A] Total number of parameters : 8.154 M
DataParallel(
  (module): InceptionGenerator(
    (down_sampling): Sequential(
      (0): ReflectionPad2d((3, 3, 3, 3))
      (1): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1))
      (2): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (3): ReLU(inplace=True)
      (4): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (5): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (6): ReLU(inplace=True)
      (7): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (8): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (9): ReLU(inplace=True)
    )
    (features): Sequential(
      (0): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (1): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (2): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (3): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (4): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (5): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (6): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (7): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (8): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
    )
    (up_sampling): Sequential(
      (0): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (1): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (2): ReLU(inplace=True)
      (3): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (4): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (5): ReLU(inplace=True)
      (6): ReflectionPad2d((3, 3, 3, 3))
      (7): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1))
      (8): Tanh()
    )
  )
)
[Network G_B] Total number of parameters : 8.154 M
DataParallel(
  (module): NLayerDiscriminator(
    (model): Sequential(
      (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): LeakyReLU(negative_slope=0.2, inplace=True)
      (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (4): LeakyReLU(negative_slope=0.2, inplace=True)
      (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (7): LeakyReLU(negative_slope=0.2, inplace=True)
      (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
      (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (10): LeakyReLU(negative_slope=0.2, inplace=True)
      (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    )
  )
)
[Network D_A] Total number of parameters : 2.767 M
DataParallel(
  (module): NLayerDiscriminator(
    (model): Sequential(
      (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): LeakyReLU(negative_slope=0.2, inplace=True)
      (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (4): LeakyReLU(negative_slope=0.2, inplace=True)
      (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (7): LeakyReLU(negative_slope=0.2, inplace=True)
      (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
      (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (10): LeakyReLU(negative_slope=0.2, inplace=True)
      (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    )
  )
)
[Network D_B] Total number of parameters : 2.767 M
-----------------------------------------------
start_epoch: 1
end_epoch: 1
total_iter: 1
current memory allocated: 265.4296875
max memory allocated: 265.4296875
cached memory: 276.0
will set input data
Traceback (most recent call last):
  File "/data/CAT/train.py", line 14, in <module>
    trainer.start()
  File "/data/CAT/trainer.py", line 159, in start
    model.optimize_parameters(total_iter)
  File "/data/CAT/models/cycle_gan_model.py", line 295, in optimize_parameters
    self.forward()
  File "/data/CAT/models/cycle_gan_model.py", line 235, in forward
    self.rec_A = self.netG_B(self.fake_B)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/CAT/models/modules/inception_architecture/inception_generator.py", line 141, in forward
    res = self.up_sampling(res)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/padding.py", line 173, in forward
    return F.pad(input, self.padding, 'reflect')
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 4014, in _pad
    return torch._C._nn.reflection_pad2d(input, pad)
RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 15.78 GiB total capacity; 14.02 GiB already allocated; 198.19 MiB free; 14.13 GiB reserved in total by PyTorch)

Based on your description the OOM might be expected, since your spatial input size increases by 4x in the number of pixels. Did you check the used peak memory for the working solution using the 512x512 images?
Based on this value, you could try to estimate the memory usage if the number of pixels is increased by 4x. Note that the model parameters will not increase, but the intermediate activation size.

No, reducing the number of images will not change the memory usage since you are only loading a minibatch into the GPU.

If you are concerned about as memory leak, I would assume that also the working 512x512 use case runs OOM eventually. Is this the case?

1 Like

hello @ptrblck ,

Thank you for such a quick response.
Yeah, since 512x512 is 1/4 smaller, it’s not a good comparison, but running a 512x512 model with batch size 4 and 32 threads works fine without a memory issue.

Could you let me know how to check peak memory during 512x512 training?

You could add some debug print statements e.g. via print(torch.cuda.memory_allocated()) or print(torch.cuda.memory_summary()) into your training code and check how much memory is used by PyTorch. Alternatively, you could also start the training and just monitor nvidia-smi -l in another terminal to estimate the peak usage.

1 Like