Out of Memory issue with multi GPUs

antsthebul · April 7, 2023, 8:22pm

I am new to ML, Deep Learning, and Pytorch. I am not sure why, but changing my batch size and image size has no effect whatsoever on the allocated memory

Tried to allocate 25.15 GiB (GPU 1; 47.54 GiB total capacity; 25.15 GiB already allocated; 21.61 GiB free; 25.16 GiB reserve

I am using to A6000 x 2 GPUS. I made a gist of the code, but if prefered I can post it here. Simple Distrbuted CNN · GitHub . The goal is to create a simple CNN that can detect the illuminated light on traffic lights. I feel like the training data is not being split across the GPUs but all data is being trained on both GPUs simultaneuously, can anyone provide assistance? I am open to criticism if you have any tips on the layout or functionality of the code.

Also i feel like there could be a general rule of thumb (or at least a ‘general’ range) that an image size of (x,x) and CNN with a certain amount of conv layers (or number of paramters) will require at least X amount of GB. What Ive noticed before is that I trained a Magic the Gathering Card classifier with CNN (distrbuted with the same code and it works) and I can never seem to build over 3 layers before reaching out of memory following the 32/64 conv layer input/output format and image size 72x72 which i know is extremely small

Ive used the code from the following links:

Learing how to use DDP - Distributed Data Parallel in PyTorch - Video Tutorials — PyTorch Tutorials 2.1.1+cu121 documentation
Optimizing memory/speed ( Towards Data Science: Optimize PyTorch Performance for Speed and Memory Efficiency (2022))

eqy · April 8, 2023, 5:30am

I’m seeing a few things that seem unusual. Are you trying to train separate models or a single model? For the single model case I would expect that you would want to use DDP/Distributed Data Parallel in conjunction with the distributed sampler.

For a reference implementation using the components I would recommend checking out the ImageNet example:

github.com

pytorch/examples/blob/main/imagenet/main.py

import argparse
import os
import random
import shutil
import time
import warnings
from enum import Enum

import torch
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.nn.parallel
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms

This file has been truncated. show original

Additionally the model used looks highly unusual, in particular:

gist.github.com

https://gist.github.com/Antsthebul/f1d970cd19a965ac58d39572c9f370d8#file-light_model-py-L29

gpu_multi.py

import csv
import datetime
import os
import time

import pandas as pd
import albumentations as A
from pathlib import Path

from dotenv import load_dotenv

This file has been truncated. show original

light_dataset.py

import cv2
from torch.utils.data.dataset import Dataset
from config import *


class LightDataset(Dataset):
  def __init__(self, image_data, labels, transform=None):
    self.image_paths = image_data.ravel() 
    self.y_labels = labels.ravel()
    self.transform = transform

This file has been truncated. show original

light_model.py

import torch
import torch.nn as nn 
from config import *


def conv_block(in_channel, out_channel, kernel=5):
    return nn.Sequential(
        nn.Conv2d(in_channels=in_channel, out_channels=out_channel, kernel_size=kernel, padding="same", bias=False),
        nn.ReLU()  , # 178, 128, CONV_1_OUTPUT (32)     
        nn.MaxPool2d(kernel_size=2, stride=1), # 177, 127 , 32

This file has been truncated. show original

There are more than three files. show original

which is an unusually large linear layer for a classification model.

You may want to take a look at the implementations for some standard torchvision models Models and pre-trained weights — Torchvision 0.15 documentation to get a feel for typical CNN classification architectures.

antsthebul · April 8, 2023, 2:00pm

Im trying to train a single model. I was trying to build a model resembling machine-learning-book/ch14_part2.ipynb at main · rasbt/machine-learning-book · GitHub which uses a 4 convolution layers to train a classifier. I see in the example that the linear layer is not as large as the model I created. I switched to using batchnorm instead of dropout, but Ill see why the linear layer is so large and go from there. Thanks for taking a look.

antsthebul · April 8, 2023, 9:46pm

Thanks @eqy my FC layer was waayy too large as mentioned. I just reduced the size to some arbitrary low and I was able to get the model running again.

The stride paramter I had set in the MaxPool was creating an absurd amoutn of paramters. I will look more into how to use this layer correctly