DataLoader with workers in Jupyter doesn't work

CodingInProgress · January 28, 2024, 7:36pm

I am using Jupyter in VSCode on Windows and this doesn’t seem to work for me. Here is a simple reproduction for it.

import torch

class SomeIterableDataset(torch.utils.data.IterableDataset):
  def __init__(self):
    super(SomeIterableDataset).__init__()
  
  def generate(self):
    while True:
      result = torch.rand((16, 8, 3))
      yield result

  def __iter__(self):
    return iter(self.generate())


def main():
  #num_workers = 0 # Works in Jupyter
  num_workers = 1 # Doesn't work in Jupyter (works when executed as .py)

  dataset = SomeIterableDataset()
  dataset_loader = torch.utils.data.DataLoader(dataset, batch_size=4, num_workers=num_workers)

  for i, x in enumerate(dataset_loader):
    print(f'{i}: {x.size()}')
    if i > 42:
      break

if __name__ == '__main__':
  main()

This leads to:

RuntimeError: DataLoader worker (pid(s) 9156) exited unexpectedly

Does someone know if there is a workaround for this issue?

CodingInProgress · January 31, 2024, 11:39am

Would it make sense to submit a bug report for this?

ptrblck · January 31, 2024, 3:08pm

This might be related to this issue.

I think this issue is beyond the scope of Pytorch.
It should be an issue of GitHub - ipython/ipython: Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.
Python multiprocessing has different behaviors on spawn and fork, interactive and script mode.
So far, any python multiple processing program are not supported in interactive mode with spawn.

CodingInProgress · January 31, 2024, 5:41pm

Thanks for the answer. I missed that closed issue.
That was my assumption. Though, I was still surprised that I never ran into this issue in TensorFlow. They likely just implemented it differently.