OOM when using multiprocessing (even though memory is there)

kl_divergence · August 9, 2022, 5:26pm

I’m getting the following error:

RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can’t allocate memory: you tried to allocate 2454416848 bytes. Error code 12 (Cannot allocate memory)

Here is the relevant part of the code that triggered it:

from multiprocessing import Pool

pool = Pool(70)

start = time.time()

print("length of lines: ", len(lines))

line_tensors = pool.map(process, lines)

print("lines processed")

print(time.time() - start)

line_tensors = [x for x in line_tensors if x is not None]

size = 0

for l in line_tensors:

size += l.shape[0]

print("total size: ", size)

megatensor = torch.cat(line_tensors, dim=0).flatten()

The total size that is printed is 306802106. The error occurs on the last line:

Traceback (most recent call last):

File "bert.py", line 73, in <module>

megatensor = torch.cat(line_tensors, dim=0).flatten()

RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 2454416848 bytes. Error code 12 (Cannot allocate memory)

I thought it’s OOM, but I’m pretty sure I have 2.5G on my machine. This similar (?) piece of code also seems to run fine:

import torch

x = []

for i in range(100):

x.append(torch.randint(1, 10000, (100000000,)))

print(i)

mega = torch.cat(x, dim=0).flatten()

Any thoughts on what might be wrong here?

ptrblck · August 9, 2022, 7:46pm

Maybe you are running out of shared memory, so check if your system limits it and increase it if needed.