Why pytorch possess ten to hundred times GPU memory than Keras

Why pytorch tensors use so much more GPU memory than Keras?
The training dataset should be no more than 300MB, but when I use Variable with requires_grad=False to load it as cuda tensor, it possesses 8GB GPU memory. While using Keras, the GPU memory usage will not go up. Keras seems to use RAM instead of GPU memory.

This issue puzzles me a lot. Single GPU has a memory at most 24G. Even if I set the training window as small as possible, training dataset use half of 24G. And when the training begins, intermediate variables use another half memory…

Does anyone find the same issue? Thank you for your help!

Here’s the data handling part:

class Data_utility(object):
    def __init__(self, dt_x, dt_y, train, valid, cuda, horizon, window, normalize=2, shuffle=True, only_test=False):
        self.cuda = cuda
        self.P = window
        self.h = horizon
        self.normalize = normalize
        self.shuffle = shuffle
        self.only_test = only_test

        self.rawdat_x, self.rawdat_y = np.array(dt_x), np.array(dt_y)
        self.dat_x = np.zeros(self.rawdat_x.shape)
        self.dat_y = np.zeros(self.rawdat_y.shape)

        self.df, self.n, self.m = self.dat_x.shape
        self.ny, self.my = self.dat_y.shape
        if not (self.ny==self.n)&(self.my==self.m): raise ValueError('X AND Y SHAPE DO NOT MATCH.')
        self.scale_x = np.ones_like(self.dat_x)
        self.scale_y = np.ones_like(self.dat_y)

        self._normalized()
        self._split(int(train), int(train+valid), self.n)

    def _normalized(self):
        self.scale_y = np.repeat(np.nanmax(np.abs(self.rawdat_y), axis=1, keepdims=True), self.rawdat_y.shape[1], axis=1)
        self.dat_y = np.where(self.scale_y==0., self.rawdat_y, self.rawdat_y/self.scale_y)
        if (self.normalize == 0):
            self.dat_x = self.rawdat_x
        if (self.normalize == 1):
            for f in range(self.df):
                self.scale_x[f, :, :] = np.nanmax(np.abs(self.rawdat_x[f, :, :]))
            self.dat_x = np.where(self.scale_x==0., self.rawdat_x, self.rawdat_x/self.scale_x)
        if (self.normalize == 2):
            for f in range(self.df):
                std_x = np.nanstd(self.rawdat_x[f, :, :])
                mean_x = np.nanmean(self.rawdat_x[f, :, :])
                if std_x == 0.: self.dat_x[f, :, :] = (self.rawdat_x[f, :, :]-mean_x)
                else: self.dat_x[f, :, :] = (self.rawdat_x[f, :, :]-mean_x)/std_x

    def _split(self, train, valid, test):
        if self.only_test:
            test_set = range(valid, self.n)
            self.test = self._batchify(test_set)
        else:
            train_set = range(self.P+self.h-1, train)
            valid_set = range(train, valid)
            test_set = range(valid, self.n)
            self.train = self._batchify(train_set)
            self.valid = self._batchify(valid_set)
            self.test = self._batchify(test_set)

    def _batchify(self, idx_set):
        n = len(idx_set)
        if n==0:
            X = torch.zeros((1,self.df,self.P,self.m))
            Y = torch.zeros((1,self.m))
            X[0,:,:,:] = torch.from_numpy(self.dat_x[:,-self.P:,:])
            return [X, Y]
        else:
            X = torch.zeros((n,self.df,self.P,self.m))
            Y = torch.zeros((n,self.m), dtype=torch.float)
            for i in range(n):
                end = idx_set[i] - self.h + 1
                start = end - self.P
                X[i,:,:,:] = torch.from_numpy(self.dat_x[:,start:end,:])
                Y[i,:] = torch.from_numpy(self.dat_y[idx_set[i], :])
            return [X, Y]

    def get_batches(self, inputs, targets, batch_size, shuffle=True):
        length = len(inputs)
        if shuffle: index = torch.randperm(length)
        else: index = torch.LongTensor(range(length))
        start_idx = 0
        while (start_idx < length):
            end_idx = min(length, start_idx + batch_size)
            excerpt = index[start_idx:end_idx]
            X = inputs[excerpt]
            Y = targets[excerpt]
            if (self.cuda):
                X = X.cuda()
                Y = Y.cuda()
            yield Variable(X, requires_grad = False), Variable(Y, requires_grad = False)
            start_idx += batch_size

Training part:

def train(data, X, Y, model, criterion, batch_size, optim, max_grad_norm):
    model.train()
    total_loss = 0
    n_samples = 0
    for X, Y in data.get_batches(X, Y, batch_size, data.shuffle):
        model.zero_grad()
        output = model(X)
        loss = criterion(output, Y)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optim.step()
        total_loss += float(loss.data.item())
        n_samples += 1

    return total_loss/n_samples

Last, the main part:

    temp_dat_x = fill_nan(temp_dat_x)
    temp_dat_y = fill_nan(temp_dat_y)
    
    if self.pos==0:
        '''build data'''
        Data = Data_utility(temp_dat_x, temp_dat_y, self.args.train_len, self.args.valid_len, self.args.cuda, self.args.horizon, self.args.window, self.args.normalize, self.args.shuffle)

        model = Model(self.args, Data)
        model.apply(weights_init)
        if self.args.cuda: model.cuda()

        criterion = makeLossFunction(self.args.loss)
        evaluateL1 = nn.L1Loss(reduction='mean')
        evaluateL2 = nn.MSELoss(reduction='mean')
        if self.args.cuda:
            criterion = criterion.cuda()
            evaluateL1 = evaluateL1.cuda()
            evaluateL2 = evaluateL2.cuda()

        best_val = 1e8
        optimizer = makeOptimizer(self.args.optim, self.args.lr, model)
        scheduler = makeScheduler(optimizer, self.args.skdlr)

        try:
            count = 0
            for epoch in range(self.args.epochs):
                if count >= 10:
                    if self.model == []: self.model = model
                    print 'valid loss stays at ', best_val, ' early stopping at epoch ', epoch
                    break
                epoch_start_time = int(round(time.time()*1000))
                train_loss = train(Data, Data.train[0], Data.train[1], model, criterion, self.args.batch_size, optimizer, self.args.clip)
                val_loss, val_rae, val_rse, val_corr = evaluate(Data, Data.valid[0], Data.valid[1], model, criterion, evaluateL1, evaluateL2, self.args.batch_size)
                scheduler.step(val_loss)
                '''save the best model'''
                if val_loss < best_val:
                    self.model = model
                    best_val = val_loss
                    count = 0
                else: count += 1

I found that in the data handling part:

        if (self.cuda):
            X = X.cuda()
            Y = Y.cuda()
        yield Variable(X, requires_grad = False), Variable(Y, requires_grad = False)

put the whole training dataset into GPU, including all batches. The typical X shape would be

torch.Size([32, 500, 60, 2130])
32: batch_size
500: feature_num (since the first step is always CNN)
60: time_window
2130: id_num
time_window * id_num could be seen as a single image, so it can be simplified as an imageNet.

which may possess large GPU memory, and even more for intermediate variables while the model is training for back propagation. How can I modify this part to make the training dataset first in RAM, and while training, use single batch transfered from RAM to GPU?

Please, could someone help me with this problem?

I think you’ve identified the problem with X = X.cuda(), it’s sending all your data to the GPU.

I’m not sure if you’re unaware, but there are already torch.utils.data.Dataset and torch.utils.data.DataLoader classes in pytorch which you have basically reimplemented.

The idea is you load a batch per time in RAM, and then send each batch to your GPU device with batch.cuda() or batch.to("cuda").

Hi, thank you for your remind. That’s what I’m wondering. I watched the tutorial of torch.utils.data.Dataset and torch.utils.data.DataLoader. But I’m not very sure in that way single mini-batch is supplied for train model.

I mean for X with torch.Size([32, 500, 60, 2130]),

for step, (batch_x, batch_y) in enumerate(dataloader):

is batch_x in shape torch.Size([1, 500, 60, 2130])? And do I have to send it to GPU manually in the train part?

By the way, why doesn’t pytorch do like Keras, automatically load batches in RAM, and send single batch to GPU. Since GPU memory is the obvious bottleneck now, no one can use large dataset in GPU memory. Why would pytorch leave this to users? Sorry if my question is too silly. I’m new to pytorch.

It will be shape torch.Size([batch_size, 500, 60, 2130]) where batch_size is specified in your dataloader.

And yes, you have to send it to the GPU manually at each batch (unless you want to load the entire thing in memory).

I can’t really comment too much on the design decisions behind pytorch, but in my opinion there is more boilerplate code but this also gives you greater flexibility.

Thank you for your reply!
It seems, while batch_x in shape torch.Size([batch_size, 500, 60, 2130]), in which batch_size is appointed, torch.utils.data.DataLoader functions just like what my DataUtil does.

The question is can pytorch send single mini-batch (shape torch.Size([1, 500, 60, 2130])) once a time to GPU model, to save GPU memory usage. I saw dramatic increase in GPU memory usage while using pytorch than in the Keras case, when the training datas and models are comparative.

Maybe this code snippet will help u with loading the dataset and then training with only a batch of data:

load data

train_set = MovingMNIST(root=‘./data/mnist’, train=True, download=True)

train_loader = DataLoader(
dataset=train_set,
batch_size=config.batch_size,
shuffle=True)

model = …
optimizer = …
loss = …
for epoch, (batch_inputs, batch_targets) in enumerate(train_loader):
batch_inputs = batch_inputs.to(device).float()
batch_targets = batch_targets.to(device).float()