GPU Memory Issue

atabia · January 16, 2019, 2:13pm

Hi everybody! How are you?
I’m having troubles trying to run a model on my GPU. The moment i load my model to the GPU the memory is full to 95%.
I tried the same model in 2 diferents GPU (GTX 1050 and RTX 2070) and both present the same issue.
The model i builded is a CNN similar to VGG16.
The weird part is that, even though the 2 GPUs have different capacities (2GB and 8GB), both seen to be full at the same percentage.
Could this be related to the CUDA Driver that i have install? Or what could it be? Does anyone have the same issue?
I will apreciate any comments! Thanks alot!

JuanFMontesinos · January 16, 2019, 2:47pm

Could you briefly show how are you loading the model? or some numbers?

atabia · January 16, 2019, 9:32pm

This is the structure of my model:

class ConvNet(nn.Module):
    
    #Definición de layers y parámetros
    def __init__(self):
        super(ConvNet, self).__init__()
        
        #Input Layer (CNN)
        self.inp_layer = nn.Sequential(
            nn.Conv2d(in_channels = 3,
                      out_channels = 64,
                      kernel_size = 5,
                      stride = 1,
                      padding = 2),
            nn.BatchNorm2d(64),
            nn.ReLU()
        )
        
        #Hidden Layer 1 (CNN)
        self.hid_layer_1 = nn.Sequential(
            nn.Conv2d(in_channels = 64,
                      out_channels = 128,
                      kernel_size = 5,
                      stride = 1,
                      padding = 2),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2,
                         stride = 2)
        )
        
        #Hidden Layer 2 (CNN)
        self.hid_layer_2 = nn.Sequential(
            nn.Conv2d(in_channels = 128,
                      out_channels = 256,
                      kernel_size = 5,
                      stride = 1,
                      padding = 2),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2,
                         stride = 2)
        )
        
        #Hidden Layer 3 (CNN)
        self.hid_layer_3 = nn.Sequential(
            nn.Conv2d(in_channels = 256,
                      out_channels = 512,
                      kernel_size = 5,
                      stride = 1,
                      padding = 2),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2,
                         stride = 2)
        )
        
        #Hidden Layer 4 (CNN)
        self.hid_layer_4 = nn.Sequential(
            nn.Conv2d(in_channels = 512,
                      out_channels = 512,
                      kernel_size = 5,
                      stride = 1,
                      padding = 2),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2,
                         stride = 2)
        )
        
        #Hidden Layer 5 (CNN)
        self.hid_layer_5 = nn.Sequential(
            nn.Conv2d(in_channels = 512,
                      out_channels = 512,
                      kernel_size = 5,
                      stride = 1,
                      padding = 2),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2,
                         stride = 2)
        )
        
        #Hidden Layer 6 (FCL)
        self.hid_layer_6 = nn.Sequential(
            nn.Linear(in_features = 7*7*512,
                      out_features = 7*7*512),
            nn.ReLU()
        )
        
        #Hidden Layer 7 (FCL)
        self.hid_layer_7 = nn.Sequential(
            nn.Linear(in_features = 7*7*512,
                      out_features = 1000),
            nn.ReLU()
        )
        
        #Hidden Layer 8 (FCL)
        self.hid_layer_8 = nn.Sequential(
            nn.Linear(in_features = 1000,
                      out_features = 1000),
            nn.ReLU()
        )
        
        #Output Layer (FCL)
        self.out_layer = nn.Sequential(
            nn.Linear(in_features = 1000,
                      out_features = 120),
            nn.Softmax(dim = 1)
        )
        
    def forward(self, x):
        
        #First Convolutonal (ReLU)
        out = self.inp_layer(x)
        
        #Second Convolutonal (ReLU)
        out = self.hid_layer_1(out)
        
        #Third Convolutonal (ReLU)
        out = self.hid_layer_2(out)
        
        #Fourth Convolutonal (ReLU)
        out = self.hid_layer_3(out)
        
        #Fifth Convolutonal (ReLU)
        out = self.hid_layer_4(out)
        
        #Sixth Convolutonal (ReLU)
        out = self.hid_layer_5(out)
        
        #Size of flatten input
        size = out.size()[1] * out.size()[2] * out.size()[3]
        
        #Flatten
        out = out.view(-1, size)
        
        #First Fully Connected Layer (ReLU)
        out = self.hid_layer_6(out)
        
        #Second Fully Connected Layer (ReLU)
        out = self.hid_layer_7(out)
        
        #Third Fully Connected Layer (ReLU)
        out = self.hid_layer_8(out)
        
        #Fourth Fully Connected Layer (SoftMax)
        out = self.out_layer(out)
        
        #Return
        return out

To check the storage of my GPU i run the next code:

model = ConvNet().cuda(device = 0)
print('GPU storage used = ' + str(np.float(torch.cuda.memory_allocated() / torch.cuda.memory_cached()) * 100) + '%')

#GPU storage used = 99.97678608366593%

This is the error that appears when i try to train my model:

RuntimeError: CUDA out of memory. Tried to allocate 2.34 GiB (GPU 0; 7.76 GiB total capacity; 4.21 GiB already allocated; 1.61 GiB free; 978.56 MiB cached)

f1recracker · January 17, 2019, 1:50am

Lower your batch size maybe? I ran a forward pass on your model here on my 2080 (8GB) with a batch size of 16. Since you have a Turing GPU, you could also use mixed precision training to reduce your model’s footprint.

atabia · January 19, 2019, 11:58pm

Hi there!
Took me some time to learn about Mixed Precision. I ended up using the module AMP from the Apex library to do it, and now my model can be train using the RTX 2070 GPU, but I still have some issues trying with the GTX 1050, which shows an error about isuficient memory.
Basicly what i did was load my model to the GPU with the .half() extension at the end of the code to set it up to FP16:

model = ConvNet().cuda().half()

Then i added this next code to the part of my training process, in which it is calculate the loss and is optimize:

Before:

loss_size = loss(outputs, labels)
loss_size.backward()
optimizer.step()

After:

loss_size = loss(outputs, labels)
with amp_handle.scale_loss(loss_size, optimizer) as scaled_loss:
    scaled_loss.backward()
optimizer.step()

By applying all this changes i was able to train my ConvNet model setting 5 epochs in the process and the batch size of the training data to 8. This taked about 50 min (10 min per epoch).

The performance of my model was bad and now i will start checking diferents ways to make it better.

The doubts that i have now are:

Could this Mixed Precision process affect the performance of my model?
And in that case, how can it be fix without increasing the memory usage of the GPU?
Are there any other considerations or ways to increase the performance in the training process? I choosed the Apex library, but im sure that exist others ways to do so.

The information about Mixed Precisions process for Pythorch using the Apex library is available in this NVIDIA post Link.

Thanks you all for the answers! I would like to keep going deeper into this subject and find news and betters ways to upgrade the performance of training process.

f1recracker · January 26, 2019, 1:16am

Sorry for the late response.

First, mixed precision speedup will only work on Volta (Titan V, GV100) and Turing (RTX 20xx) Architectures. Your GTX 1050 will not benefit from this and will still use FP32 computation underneath.

Second, mixed precision can affect your performance. However, based on results I’ve seen, the performance difference will be minimal when trained right.

If your performance is bad then it could be due to several reasons, and I’d try to converge using FP32 first, unless you have a very strong reason to believe that your existing code must converge.

atabia · January 27, 2019, 12:59am

Thanks for the answer!
Yes, i could use it all right on the RTX 2070, but didn’t know about the limitation for GTX.