Mmap memory error when use multiple CPU on Azure

Zizhuo · June 20, 2017, 1:43pm

Hi all,

I am using multiple CPUS to train my model on azure with MongoDB. It seems I need to open a connection to data in each of the threads. Then I got this error.

Traceback (most recent call last):
  File "main.py", line 225, in <module>
    model.share_memory()
  File "/home/textiq/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 468, in share_memory
    return self._apply(lambda t: t.share_memory_())
  File "/home/textiq/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 118, in _apply
    module._apply(fn)
  File "/home/textiq/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 124, in _apply
    param.data = fn(param.data)
  File "/home/textiq/anaconda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 468, in <lambda>
    return self._apply(lambda t: t.share_memory_())
  File "/home/textiq/anaconda/lib/python3.6/site-packages/torch/tensor.py", line 86, in share_memory_
    self.storage().share_memory_()
  File "/home/textiq/anaconda/lib/python3.6/site-packages/torch/storage.py", line 101, in share_memory_
    self._share_fd_()
RuntimeError: $ Torch: unable to mmap memory: you tried to mmap 0GB. at /py/conda-bld/pytorch_1493681908901/work/torch/lib/TH/THAllocator.c:317`

Could some one tell me what to do to solve this problem?

Thanks in advance.

Zizhuo · June 20, 2017, 8:01pm

I am using Ubuntu 16.04, pytorch, linux 4.4.0-81-generic, python 3.6

smth · June 22, 2017, 2:39pm

this is weird. I wonder if Azure is somehow limiting the shared memory available to your process. Are you running docker inside azure?
Also, what’s the output of:

ipcs -lm

Zizhuo · June 22, 2017, 3:39pm

Thanks for your reply. I just figured out what happened. I didn’t use docker inside azure. The problem is I mistakenly initialized a nn.embedding in the model with size of 0. (for example nn.Embedding(0,300)). Then, I will generate this error when model.share_memory(). Now, I fixed it.

smth · June 22, 2017, 3:56pm

thanks for figuring this out. we’ll improve the error message in this situation, you can track it https://github.com/pytorch/pytorch/issues/1878

Zizhuo · June 22, 2017, 4:10pm

Really appreciate your prompt reply!