Pytorch DataLoader re-imports its own caller

I am using torch 1.9.0 in Python 3.7.6 on Windows 10.

I am finding that each DataLoader worker, when starting up, imports the script from which the DataLoader is being called. I tried to make a minimal example below.

Is this the expected behavior? It really surprises me, since it seems to assume the caller script can be re-imported with no side-effects in every case. Or am I using DataLoader incorrectly?


import inspect
import torchvision
import torch.utils.data as tu_data

def get_dl():
    batch_size = 1
    num_workers = 4
    
    print('batch_size:', batch_size)
    print('num_workers:', num_workers)
    
    transform=torchvision.transforms.ToTensor()

    mnist_data = torchvision.datasets.MNIST('./data/',
                                          transform=transform, download=True)
    data_loader = tu_data.DataLoader(mnist_data,
                                          batch_size=batch_size,
                                          shuffle=False,
                                          num_workers=num_workers,
                                          pin_memory=True)

    count = 0
    print('Entering data_loader loop....')
    for datum in data_loader:
        print('count:',count)
        count+=1
        if count > 5:
            break

if __name__ == '__main__':
    get_dl()
else:
    print('MYSTERY IMPORT!', flush=True)
    print('__name__ is',__name__)
    print('Importer is:', inspect.currentframe().f_back.f_code.co_name) # Print out how we got here. 

The output looks like this (I am not worried about the warning):

C:\Users\peria\Anaconda3\envs\segmenter\lib\site-packages\torchvision\datasets\mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ..\torch\csrc\utils\tensor_numpy.cpp:180.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Entering data_loader loop....
MYSTERY IMPORT!
__name__ is __mp_main__
Importer is: _run_code
MYSTERY IMPORT!
__name__ is __mp_main__
Importer is: _run_code
MYSTERY IMPORT!
__name__ is __mp_main__
Importer is: _run_code
MYSTERY IMPORT!
__name__ is __mp_main__
Importer is: _run_code
count: 0
count: 1
count: 2
count: 3
count: 4
count: 5

Yes, I think this is expected as Windows uses spawn instead of fork as described in the Windows FAQ:

The implementation of multiprocessing is different on Windows, which uses spawn instead of fork. So we have to wrap the code with an if-clause to protect the code from executing multiple times. Refactor your code into the following structure.

import torch

def main()
    for i, data in enumerate(dataloader):
        # do something here

if __name__ == '__main__':
    main()

Thank you for answering a FAQ; I didn’t yet understand how to search for this particular thing in the FAQs.

My example has the recommended structure, just by luck. But now I see that putting everything in main() (or get_dl() as I did), completely prevents the “side effects” I was worried about. We just have to make sure there are no executable statements outside of main(), and that takes care of it. And also I guess DataLoader needs to import some modules for each worker; that’s why it does the import. Since it is importing a whole script, it might import some things it does not need, but that does no harm.

p.s. I am always glad to see your responses to a question I have searched for; I always learn something from them.