Hello,
I am attempting to initialize and allocate space for ~10,000 small, single hidden layer mlps with shared memory. Here is how the models are created:
def model_factory(device, hidden_size=128, init_type='xavier_uniform', share_memory=True):
model = AntNN(hidden_size=hidden_size, init_type=init_type).to(device)
model.apply(model.init_weights)
if share_memory:
model.share_memory()
return model
mlps = []
device = torch.device('cpu')
for _ in range(num_policies):
mlp = model_factory(device, 128, share_memory=True)
mlps.append(mlp)
print(f'RAM Memory % used: {psutil.virtual_memory()[2]}')
I’m keeping track of the total RAM usage and this is the last statement that printed before the error:
RAM Memory % used: 56.8
So I clearly have more than enough memory available. Here is the “cannot allocate memory” error
Traceback (most recent call last):
File "/home/user/map-elites/testing.py", line 164, in <module>
mlp = model_factory(device, 128, share_memory=True)
File "/home/user/map-elites/models/ant_model.py", line 11, in model_factory
model.to(device).share_memory()
File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1805, in share_memory
return self._apply(lambda t: t.share_memory_())
File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
module._apply(fn)
File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
param_applied = fn(param)
File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1805, in <lambda>
return self._apply(lambda t: t.share_memory_())
File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/_tensor.py", line 482, in share_memory_
self.storage().share_memory_()
File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/storage.py", line 480, in share_memory_
self._storage.share_memory_()
File "/home/user/miniconda3/envs/map-elites/lib/python3.8/site-packages/torch/storage.py", line 160, in share_memory_
self._share_filename_()
RuntimeError: unable to mmap 65600 bytes from file </torch_6936_3098808190_62696>: Cannot allocate memory (12)
Process finished with exit code 1
With share_memory=False
this works just fine, but for my application, it is critical that these tensors exist in shared memory b/c these tensors are modified by different processes. Is this some bug or a fundamental limitation to how shared memory in pytorch works? Is there any way to get around this problem?
EDIT: I noticed that PyTorch’s default sharing system file_descriptor
opens a lot of file descriptors, and that I might be reaching my system’s soft (or hard) open files limit. So I tried increasing the number from 1024 → 1,000,000 and was able to make it to
Num mlps: 6966 / 10,000
RAM Memory % used: 62.0
Before running into the mmap error. I tried playing around with different values for the max number of open file descriptors allowed by the system and couldn’t get past that number, so I don’t think it’s bottlenecked by the number of allowed file descriptors anymore. I also tried
torch.multiprocessing.set_sharing_strategy('file_system')
since it seems to keep track of less file descriptors per tensor, but this didn’t help either.