DistNetworkError when using multiprocessing_context parameter in pytorch dataloader

Because of some special reasons I want to use spawn method to create worker in DataLoader of Pytorch, this is demo:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset
import lightning

fabric = lightning.Fabric(devices=[0, 2], num_nodes=1, strategy='ddp')

class LinearModel(nn.Module):
    def __init__(self):
        self.linear = nn.Linear(10, 2)  

    def forward(self, x):
        return self.linear(x)

if __name__ == '__main__':
    x = torch.randn(100, 10)
    y = torch.rand(100, 2)
    dataset = TensorDataset(x, y)
    # crashed because of multiprocessing_context='spawn'
    train_loader = fabric.setup_dataloaders(DataLoader(dataset, batch_size=10, shuffle=True, 
                   num_workers=1, multiprocessing_context='spawn'))
    model = LinearModel()
    crit = nn.MSELoss()
    model, optimizer = fabric.setup(model, optim.Adam(model.parameters(), lr=0.01))
    for epoch in range(0, 10):
        print(f'Epoch {epoch}')
        for xs, ys in train_loader:
            output = model(xs)
            loss = crit(output, ys)

But it crashed with this error:

# https://pastebin.com/BqA9mjiE
Epoch 0
Epoch 0
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. 
The server socket has failed to bind to [::]:55733 (errno: 98 - Address already in use). 
The server socket has failed to bind to (errno: 98 - Address already in use).

Port 55733 is listened by training processes before so it will crash.

But I want to know, why port will be bind repeatedly when multiprocessing_context is spawn?

My version of Pytorch is 2.2.2 and fabric’s version is 2.4.0.

Hope for your reply.

I’m not familiar with lightning fabric (maybe ask the lightning folks in their forum), but you would see that “Address already in use” during the bootstrap phase if two or more processes are trying to create a server store during initialization. Maybe lightning is already creating a store to rendevouz on and then the dataloader conflicts with that. If the dataloader has num_workers=0 set, does that still error?

Is the “spawn” argument actually related? Does it work if you remove it?