MemoryError when spawning workers under Windows

Cow_woC · August 3, 2021, 7:02pm

Hi,

I am getting a MemoryError when iterating over train_loader in the following code snippet:

    def optimizer_step():
        optimizer.zero_grad()
        losses = []
        print(h.heap())
        for x_batch, y_batch in train_loader:
            # print(f"batch_size: {x_batch.shape}, {y_batch.shape}")
            prediction = model(x_batch)
            loss = loss_function(prediction, y_batch)
            print("Loss: ", loss.item())
            loss.backward()
            losses.append(loss.unsqueeze(0))
        losses = torch.cat(losses, dim=0)
        return torch.sum(losses)

The thing that’s odd is I’m running Python 64-bit under Windows 10 with >30GB free RAM. When I run guppy.hpy.heap() I get see this memory usage:

Partition of a set of 642221 objects. Total size = 72264693 bytes.

And here is the full traceback:

Traceback (most recent call last):
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\com.mycompany.aggregator\pytorch-ours.py", line 181, in <module>
    optimizer.step(optimizer_step)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\optim\optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\optim\lbfgs.py", line 311, in step
    orig_loss = closure()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\com.mycompany.aggregator\pytorch-ours.py", line 157, in optimizer_step
    for x_batch, y_batch in train_loader:
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\utils\data\dataloader.py", line 354, in __iter__
    self._iterator = self._get_iterator()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\utils\data\dataloader.py", line 918, in __init__
    w.start()
  File "C:\Python39\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Python39\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python39\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Python39\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python39\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
MemoryError

How do I go about debugging what is going on? I tried reducing num_workers to 1 and the problem still occurs. Setting num_workers to 0 avoids the problem but this operation is computation-intensive so I want to run across multiple processes.

ptrblck · August 8, 2021, 7:32am

I guess Windows is running into issues with IPC and/or shared memory (I’m not deeply familiar with Window’s limitations in IPC). Were multiple workers working before in this setup or were you always hitting this issue?

Cow_woC · August 8, 2021, 3:54pm

Were multiple workers working before in this setup or were you always hitting this issue?

It’s hard to tell. I’ve used multiple workers with code samples I found online. They also ran out of memory for no good reason, but only if I used a large number of workers. Using 1-2 workers seemed to work fine. Using 8 workers (corresponding to the number of CPU cores in my machine) complains about lack of memory even though the system has plenty left.

When I try using multiple workers in my own (custom) project, I can’t even use one worker.

I suspect there is a bug with workers, but my specific project triggers that bug earlier than other projects.

googlebot · August 8, 2021, 7:34pm

if your script is trying to serialize like 100GB at once (>memory+swap), you won’t see anything unusual from outside. Set a breakpoint in ForkingPickler.dumps and see if manual serialization succeeds for your big object.

Cow_woC · September 22, 2021, 4:00am

I eventually caught on that one of my model’s fields was transitively referencing a large amount of samples, all of which were being serializes/unserialized across the worker processes.