Sharing LSTM through multiprocessing results in non contiguous chunk of memory warning

Hi :smile:

I tried training LSTM network with multiple processes (but single GPU) using torch.multiprocessing but I can’t get rid of this warning:
UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().

Below is a simple code that can reproduce it. I tried using flatten_parameters after model creation (before sharing it by multiple processes, (1) in code). This makes no difference at all. Another place where I tried using it was (2) when each process gets its own copy but this makes memory not shared anymore.

Without flattening parameters networks are using whole lot more memory on GPU (around two times more). Is there a way to solve this issue and flatten_parameters?

I’m using Manjaro OS.

import time
import torch
from torch import nn
import torch.multiprocessing as mp


num_proc = 3
units = 1024


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(units, units, num_layers=8)

    def forward(self, x):
        return self.lstm(x)


def train(pid, model):
    time.sleep(pid + 1)
    x = torch.randn(1, 1, units).cuda()

    # (2) calling flatten_parameters here makes warning
    # go away but it also makes memory not shared anymore
    # model.lstm.flatten_parameters()

    while True:
        model(x)

        with torch.no_grad():
            for w in model.parameters():
                w += 1.

        print(pid, list(model.parameters())[0][0, 0].item(), flush=True)
        time.sleep(num_proc)


def main():
    mp.set_start_method('spawn')
    pool = mp.Pool(num_proc + 1)

    model = Net().cuda()
    # (1) this line doesn't make any difference
    # no matter if after sharing memory or before
    # model.lstm.flatten_parameters()
    model.share_memory()

    processes = []
    for pid in range(num_proc):
        processes.append(pool.apply_async(train, args=(pid, model)))

    try:
        for p in processes:
            p.get(timeout=1000000.)
    except KeyboardInterrupt:
        print('Terminating pool...')
        pool.terminate()


if __name__ == '__main__':
    main()
1 Like

I’m also seeing this with shared memory on Ubuntu 16.04: I printed out the parameters and there is indeed no memory sharing going on if you call flatten_parameters() in the subprocesses.

1 Like