Hi, I am trying to tweak (I am not a power user, sorry) a model using forward propagation and I get the warning about argument changes for the softmax call and also I get an OOM in GPU usage at the same level, can somebody give me a hint how to introduce the dim=X argument in the softmax call? Maybe from this reason I get the OOM…Thanks
UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
def forward(self, x, target):
# get embeddings and apply dropout
x = self.embed(x)
x = self.embed_drop(x)
x = x.transpose(1, 2)
# apply convolution and nonlinearity (tanh)
x = F.tanh(self.conv(x).transpose(1,2))
y = []
ss = []
alphas = []
for x_i in x:
# apply attention
# HERE COMES THE WARNING!!!!!!!!!!!!!!!
alpha_i = F.softmax(self.U(x_i).t())
# document representations are weighted sums using the attention. Can compute all at once as a matmul
m_i = alpha_i.mm(x_i)
# final layer classification
y_i = self.final.mul(m_i).sum(dim=1).add(self.final_bias)
# save attention
alphas.append(alpha_i)
y.append(y_i)
r = torch.stack(y)
# print(torch.cuda.is_available())
torch.cuda.empty_cache()
# print('current memory allocated before: {}'.format(torch.cuda.memory_allocated() / 1024 ** 2))
# print('max memory allocated before: {}'.format(torch.cuda.max_memory_allocated() / 1024 ** 2))
# print('cached memory before: {}'.format(torch.cuda.memory_cached() / 1024 ** 2))
alpha = torch.stack(alphas)
torch.cuda.empty_cache()
# final sigmoid to get predictions
yhat = F.sigmoid(r)
if target is not None:
loss = self.get_loss(yhat, target)
else:
loss = 0
return yhat, loss, alpha
You can pass the dim argument as the second argument to F.softmax:
alpha_i = F.softmax(self.U(x_i).t(), dim=1)
What are you doing with alpha after the forward call? Currently you are storing it in a tensor, which will also store the computation graph and thus increase the memory. If you just want to use alpha for visualizations or debugging, you should detach the tensors from the computation graph using alphas.append(alpha_i.detach()).
Thanks for the help, well about alphas I don’t see any further usage…but maybe I am wrong. The full code is available here (the softmax call occurs at line 93:
and even after setting dim=1 I still have OOM with big training text files on 4 GB GPU. There are actually two “memory black-holes” in the code, one in models.py at the level of the forward function and another one that is killing me (and apparently I will have to buy a bigger GPU board) is a “classical” memory-hole at the level of loss.backward() in this file, at line 174:
I am training the model with a 1,7 Mio lines text file, having an average of 500 chars per row and I can’t succeed even with a batch_size=1 (OOM after several Ks of trained lines). If the training file drops to about 200K rows then I can finish the training with batch_size = 4. I tried everything possible on earth (except accumulating losses…which I don’t know how to implement yet in this code) to get out of the never-ending:
RuntimeError: CUDA out of memory. Tried to allocate 438.63 MiB (GPU 0; 4.00 GiB total capacity; 2.64 GiB already allocated; 389.80 MiB free; 5.39 MiB cached) but it seems impossible.
It looks like alpha is never used in the other scripts, so this shouldn’t be the problem (although it looks like a bug to me).
Are you seeing an increasing memory usage?
The size of your dataset shouldn’t change the memory requirements, so I’m wondering why the smaller dataset seems to work.
The memory increases progressively with each 500-1000 iterations batches until it reaches the max GPU capacity. gc.collect, cuda.empty_cache(), deleting the loss, input, target after the optimizer.step and loss is added to the losses[] array, none of them work.
Maybe the smaller datasets produce smaller embedding file (processed.embed) and since this is fully loaded in the GPU cache then it leaves a smaller amount free when it is bigger (80 Mb with smaller datasets, 200Mb with bigger)
Well, the problem it seems to come from the number of the classes (labels) or of training file size. I diminished the size of the embeddings collected and the vocabulary to half and still the same OOM. The only difference to the “successful” training situation is the number of labels and size of training file.
The OOM is currently occurring at stacking alphas (strangely) in the forward method.
Hi again, I did some experiments with batch_size 4, 6 and 8 and the OOM always occurs at the same number of iterations for each batch size: for 4 at 1288 iterations, for 6 at 372 ,for 8 at 102. Something is very strange, I don’t know where form comes this fix limitations. The memory type that blows out is the cached GPU memory, it reaches a 3 or 4 times the normal threshold: