Softmax argument deprecation and OOM

raduben · March 16, 2019, 10:30am

Hi, I am trying to tweak (I am not a power user, sorry) a model using forward propagation and I get the warning about argument changes for the softmax call and also I get an OOM in GPU usage at the same level, can somebody give me a hint how to introduce the dim=X argument in the softmax call? Maybe from this reason I get the OOM…Thanks

UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.

Warning line: alpha_i = F.softmax(self.U(x_i).t())

######### Method code ##########

def forward(self, x, target):
    # get embeddings and apply dropout
    x = self.embed(x)
    x = self.embed_drop(x)
    x = x.transpose(1, 2)

    # apply convolution and nonlinearity (tanh)
    x = F.tanh(self.conv(x).transpose(1,2))
    y = []
    ss = []
    alphas = []
    for x_i in x:
        # apply attention
        
        # HERE COMES THE WARNING!!!!!!!!!!!!!!!

        alpha_i = F.softmax(self.U(x_i).t())

        # document representations are weighted sums using the attention. Can compute all at once as a matmul
        m_i = alpha_i.mm(x_i)

        # final layer classification
        y_i = self.final.mul(m_i).sum(dim=1).add(self.final_bias)

        # save attention
        alphas.append(alpha_i)
        y.append(y_i)

    r = torch.stack(y)
    # print(torch.cuda.is_available())
    torch.cuda.empty_cache()
    # print('current memory allocated before: {}'.format(torch.cuda.memory_allocated() / 1024 ** 2))
    # print('max memory allocated before: {}'.format(torch.cuda.max_memory_allocated() / 1024 ** 2))
    # print('cached memory before: {}'.format(torch.cuda.memory_cached() / 1024 ** 2))
    alpha = torch.stack(alphas)
    torch.cuda.empty_cache()
    # final sigmoid to get predictions
    yhat = F.sigmoid(r)
    if target is not None:
        loss = self.get_loss(yhat, target)
    else:
        loss = 0
    return yhat, loss, alpha

ptrblck · March 16, 2019, 12:30pm

You can pass the dim argument as the second argument to F.softmax:

alpha_i = F.softmax(self.U(x_i).t(), dim=1)

What are you doing with alpha after the forward call? Currently you are storing it in a tensor, which will also store the computation graph and thus increase the memory. If you just want to use alpha for visualizations or debugging, you should detach the tensors from the computation graph using alphas.append(alpha_i.detach()).

raduben · March 16, 2019, 4:07pm

Thanks for the help, well about alphas I don’t see any further usage…but maybe I am wrong. The full code is available here (the softmax call occurs at line 93:

github.com

caolingyu/Purr/blob/master/models.py

# coding = utf-8
"""
    Holds PyTorch models
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.init import xavier_uniform
from torch.autograd import Variable

import numpy as np

import math
import random
import sys
import time

sys.path.append('../')
from constants import *
from preprocess import extract_wvs

This file has been truncated. show original

and even after setting dim=1 I still have OOM with big training text files on 4 GB GPU. There are actually two “memory black-holes” in the code, one in models.py at the level of the forward function and another one that is killing me (and apparently I will have to buy a bigger GPU board) is a “classical” memory-hole at the level of loss.backward() in this file, at line 174:

github.com

caolingyu/Purr/blob/master/train.py

# coding = utf-8
"""
    Main training code. Loads data, builds the model, trains, tests, evaluates, writes outputs, etc.
"""
import sys
# sys.path.insert(0, '..')
import torch
import torch.optim as optim
from torch.autograd import Variable

import csv
import argparse
import os 
import numpy as np
import operator
import random
import sys
import time
from tqdm import tqdm
from collections import defaultdict

This file has been truncated. show original

I am training the model with a 1,7 Mio lines text file, having an average of 500 chars per row and I can’t succeed even with a batch_size=1 (OOM after several Ks of trained lines). If the training file drops to about 200K rows then I can finish the training with batch_size = 4. I tried everything possible on earth (except accumulating losses…which I don’t know how to implement yet in this code) to get out of the never-ending:

RuntimeError: CUDA out of memory. Tried to allocate 438.63 MiB (GPU 0; 4.00 GiB total capacity; 2.64 GiB already allocated; 389.80 MiB free; 5.39 MiB cached) but it seems impossible.

ptrblck · March 16, 2019, 5:18pm

It looks like alpha is never used in the other scripts, so this shouldn’t be the problem (although it looks like a bug to me).

Are you seeing an increasing memory usage?
The size of your dataset shouldn’t change the memory requirements, so I’m wondering why the smaller dataset seems to work.

raduben · March 16, 2019, 6:07pm

The memory increases progressively with each 500-1000 iterations batches until it reaches the max GPU capacity. gc.collect, cuda.empty_cache(), deleting the loss, input, target after the optimizer.step and loss is added to the losses[] array, none of them work.

raduben · March 16, 2019, 6:12pm

I even printed out what happens after each of them (memory cleaning attempts) occurs and it seems it never get’s cleaned.

optimizer.zero_grad()

    output, loss, _ = model(data, target)

    gc.collect()

    loss.backward()
    optimizer.step()

    losses.append(loss.item())
    # print("Cleaning after loss")
    # optimizer.zero_grad()


    torch.cuda.empty_cache()

    if not quiet and batch_idx % print_every == 0:
        # print the average loss of the last 100 batches
        print("Train epoch: {} [batch #{}, batch_size {}, seq length {}]\tLoss: {:.6f}".format(
            epoch+1, batch_idx, data.size()[0], data.size()[1], np.mean(losses[-100:])))
        gc.collect()
        print("memory before cuda.emptycache")
        print(torch.cuda.memory_allocated())
        print(torch.cuda.memory_cached())
        torch.cuda.empty_cache()
        optimizer.zero_grad()
        del loss, data, target
        print("memory after cuda.emptycache")
        print(torch.cuda.memory_allocated())
        print(torch.cuda.memory_cached())

Train epoch: 1 [batch #0, batch_size 2, seq length 234] Loss: 0.693175
memory before cuda.emptycache
288670720
319029248
memory after cuda.emptycache
288626176
319029248
999it [01:35, 10.82it/s]Train epoch: 1 [batch #1000, batch_size 2, seq length 491] Loss: 0.001480
memory before cuda.emptycache
298898432
329252864
memory after cuda.emptycache
298849792
329252864
1999it [03:14, 10.23it/s]Train epoch: 1 [batch #2000, batch_size 2, seq length 560] Loss: 0.001411
memory before cuda.emptycache
301651968
332005376
memory after cuda.emptycache
301602304
332005376
2444it [03:57, 9.56it/s]Traceback (most recent call last):----------here it dies

raduben · March 16, 2019, 6:15pm

Maybe the smaller datasets produce smaller embedding file (processed.embed) and since this is fully loaded in the GPU cache then it leaves a smaller amount free when it is bigger (80 Mb with smaller datasets, 200Mb with bigger)

raduben · March 16, 2019, 6:17pm

This is what happens when training:

loading pretrained embeddings…
adding unk embedding

Conv_Attn(
(embed_drop): Dropout(p=0.2)
(embed): Embedding(121674, 100)
(conv): Conv1d(100, 500, kernel_size=(3,), stride=(1,), padding=(1,))
(U): Linear(in_features=500, out_features=4987, bias=True)
)

raduben · March 18, 2019, 5:17pm

Well, the problem it seems to come from the number of the classes (labels) or of training file size. I diminished the size of the embeddings collected and the vocabulary to half and still the same OOM. The only difference to the “successful” training situation is the number of labels and size of training file.
The OOM is currently occurring at stacking alphas (strangely) in the forward method.

Thanks

raduben · March 23, 2019, 11:00am

Hi again, I did some experiments with batch_size 4, 6 and 8 and the OOM always occurs at the same number of iterations for each batch size: for 4 at 1288 iterations, for 6 at 372 ,for 8 at 102. Something is very strange, I don’t know where form comes this fix limitations. The memory type that blows out is the cached GPU memory, it reaches a 3 or 4 times the normal threshold:

93it [00:36, 3.31it/s]stack - cached memory before: 610.75
94it [00:36, 3.13it/s]stack - cached memory before: 618.0
95it [00:36, 2.99it/s]stack - cached memory before: 625.5
96it [00:37, 2.88it/s]stack - cached memory before: 616.25
97it [00:37, 2.72it/s]stack - cached memory before: 588.375
98it [00:38, 2.74it/s]stack - cached memory before: 571.625
99it [00:38, 2.81it/s]stack - cached memory before: 576.0
100it [00:38, 2.87it/s]stack - cached memory before: 583.375
101it [00:39, 2.89it/s]stack - cached memory before: 421.0
102it [00:39, 3.24it/s]stack - cached memory before: 2360.75