Memory allocation errors when attempting to initialize a large number of small feed-forward networks in RAM with shared memory despite having enough memory

ryoukai · May 19, 2022, 12:54am

Hello,

I am attempting to initialize and allocate space for ~10,000 small, single hidden layer mlps with shared memory. Here is how the models are created:

def model_factory(device, hidden_size=128, init_type='xavier_uniform', share_memory=True):
    model = AntNN(hidden_size=hidden_size, init_type=init_type).to(device)
    model.apply(model.init_weights)
    if share_memory:
        model.share_memory()
    return model

  mlps = []
  device = torch.device('cpu')
  for _ in range(num_policies):
      mlp = model_factory(device, 128, share_memory=True)
      mlps.append(mlp)
      print(f'RAM Memory % used: {psutil.virtual_memory()[2]}')

I’m keeping track of the total RAM usage and this is the last statement that printed before the error:

RAM Memory % used: 56.8

So I clearly have more than enough memory available. Here is the “cannot allocate memory” error

Traceback (most recent call last):
  File "/home/user/map-elites/testing.py", line 164, in <module>
    mlp = model_factory(device, 128, share_memory=True)
  File "/home/user/map-elites/models/ant_model.py", line 11, in model_factory
    model.to(device).share_memory()
  File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1805, in share_memory
    return self._apply(lambda t: t.share_memory_())
  File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1805, in <lambda>
    return self._apply(lambda t: t.share_memory_())
  File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/_tensor.py", line 482, in share_memory_
    self.storage().share_memory_()
  File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/storage.py", line 480, in share_memory_
    self._storage.share_memory_()
  File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/storage.py", line 160, in share_memory_
    self._share_filename_()
RuntimeError: unable to mmap 65600 bytes from file </torch_6936_3098808190_62696>: Cannot allocate memory (12)

Process finished with exit code 1

With share_memory=False this works just fine, but for my application, it is critical that these tensors exist in shared memory b/c these tensors are modified by different processes. Is this some bug or a fundamental limitation to how shared memory in pytorch works? Is there any way to get around this problem?

EDIT: I noticed that PyTorch’s default sharing system file_descriptor opens a lot of file descriptors, and that I might be reaching my system’s soft (or hard) open files limit. So I tried increasing the number from 1024 → 1,000,000 and was able to make it to

Num mlps: 6966 / 10,000
RAM Memory % used: 62.0

Before running into the mmap error. I tried playing around with different values for the max number of open file descriptors allowed by the system and couldn’t get past that number, so I don’t think it’s bottlenecked by the number of allowed file descriptors anymore. I also tried
torch.multiprocessing.set_sharing_strategy('file_system') since it seems to keep track of less file descriptors per tensor, but this didn’t help either.