DataParallel imbalanced memory usage

I am having the same imbalance issue but the problem is that my gpu 1 not gpu 0 is going out of memory. Both gpus have 32GB of memory. With NVIDIA-SMI i see that gpu 0 is only using 6GB of memory whereas, gpu 1 goes to 32.
I could have understood if it was other way around with gpu 0 going out of memory but this is weird.
I only pass my model to the DataParallel so it’s using the default values.
Also, if I use only 1 GPU, i don’t get any out of memory issues. This is also strange for me.
Any help would be appreciated.
p.s. I was getting warning about rnn parameters not being in contiguous memory so i added the flatten_parameters() call as well in forward of lstm

 cudaID = str(torch.cuda.current_device())
 device = torch.device("cuda:" + cudaID) 
 print('device = ', device) // this prints cuda:0

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    encoder = torch.nn.DataParallel(encoder)
    lstm_model = torch.nn.DataParallel(lstm_model)


encoder.to(device)
lstm_model.to(device)

// forward method of lstm

def forward(self, inputs, mode='train'):
    packed = tn.pack_sequence(inputs, enforce_sorted=False)
    self.hidden = self.init_hidden(len(inputs), packed.data.device) // sending device of packed so both packed and self.hidden are on same device, as self.hidden is created in every call and im using multiple gpus
    self.lstm.flatten_parameters()
    if mode == 'eval' or mode == 'test':
        with torch.no_grad():
            packed_out, self.hidden = self.lstm(packed, self.hidden)
    else:
        packed_out, self.hidden = self.lstm(packed, self.hidden)

 
    outputs, lens = tn.pad_packed_sequence(packed_out, batch_first=True)

    return outputs