Problem with memory usage and poor results using LSTMs over words, characters and labels

This discussion is a follow up from a previous one, which actually changed subject with my intervention :

I continue here to not bother all the participants of the previous discussion.

In order to answer Simon W. questions (at the bottom of previous discussion):

I see, however I changed quickly this, and results on training data improved. They go above 99% accuracy already at the second iteration (keeping padding into account).

char_input is initialised like this:

self.char_hidden = (autograd.Variable(torch.zeros(self.num_directions, self.batch_size, self.char_hidden_dim).type(dtype), requires_grad=False, volatile=vflag),
                    autograd.Variable(torch.zeros(self.num_directions, self.batch_size, self.char_hidden_dim).type(dtype), requires_grad=False, volatile=vflag))

dtype is torch.cuda.FloatTensor when cuda is enabled (which is the case), otherwise it is torch.FloatTensor
vflag is a flag which is True when the model is run in testing mode.

It’s actually both in different senses:
Memory usage is increasing on RAM, slightly, but increasing.
Momory usage is large and constant on GPU. It takes 5.7GB for the model and only one batch of data (10 sentences).
A colleague of mine is loading the whole data set on GPU, which is also a bigger data set than mine, and his script takes only 2.5GB.
And when I run without cuda, my script takes 1GB in total on RAM (so for the whole dataset, not just the current batch, plus the model and few other stuffs).

Thank you once again for your answers.

There is a bug in cudnn that makes rnns leaking CPU memory. It leaks very slowly. That is probably the cause of your first observation. From your second observation, it doesn’t seem a memory leak. There is probably some part of the graph that could be optimized. Could you try using a list of tensors for char_rep, char_hidden and hidden_state? Storing things in slices sometimes leads to copying the entire tensor multiple times.