Because of some special reasons I want to use spawn
method to create worker in DataLoader
of Pytorch, this is demo:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset
import lightning
fabric = lightning.Fabric(devices=[0, 2], num_nodes=1, strategy='ddp')
fabric.launch()
class LinearModel(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(10, 2)
def forward(self, x):
return self.linear(x)
if __name__ == '__main__':
x = torch.randn(100, 10)
y = torch.rand(100, 2)
dataset = TensorDataset(x, y)
# crashed because of multiprocessing_context='spawn'
train_loader = fabric.setup_dataloaders(DataLoader(dataset, batch_size=10, shuffle=True,
num_workers=1, multiprocessing_context='spawn'))
model = LinearModel()
crit = nn.MSELoss()
model, optimizer = fabric.setup(model, optim.Adam(model.parameters(), lr=0.01))
for epoch in range(0, 10):
print(f'Epoch {epoch}')
for xs, ys in train_loader:
output = model(xs)
loss = crit(output, ys)
fabric.backward(loss)
optimizer.step()
But it crashed with this error:
# https://pastebin.com/BqA9mjiE
Epoch 0
Epoch 0
……
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
The server socket has failed to bind to [::]:55733 (errno: 98 - Address already in use).
The server socket has failed to bind to 0.0.0.0:55733 (errno: 98 - Address already in use).
Port 55733 is listened by training processes before so it will crash.
But I want to know, why port will be bind repeatedly when multiprocessing_context
is spawn
?
My version of Pytorch is 2.2.2
and fabric’s version is 2.4.0
.
Hope for your reply.