Hello! I am trying to set up a training script using DistributedDataParallel (DDP) where the model changes between training and evaluation modes. However, when I try to switch into evaluation mode with model=model.eval()model becomes a NoneType. I also tried to use model=model.train(False) but the result was the same.
My issue is reproduceable with modifying the DDP example, thus:
import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# initialize the process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.drop1 = nn.Dropout(p=0.6)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.drop1(self.net1(x))))
def demo_basic(rank, world_size):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
# create model and move it to GPU with id rank
model = ToyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
# Training mode
print("Training")
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(rank)
loss_fn(outputs, labels).backward()
optimizer.step()
# Evaluation mode
print("Evaluating")
ddp_model = ddp_model.eval()
outputs = ddp_model(torch.randn(20, 10))
cleanup()
def run_demo(demo_fn, world_size):
mp.spawn(demo_fn,
args=(world_size,),
nprocs=world_size,
join=True)
if __name__ == "__main__":
run_demo(demo_basic, 1)
What is the proper way of switching between modes DDP? (Or it is not intended to be switched?)
Strangely enough I am using the version 1.5.1 and the line returning self is present in the train() function. I even tried to reinstall 1.5.1 after cleaning conda cache. Then I created a a new conda environment and installed pytorch with python 3.8 (as I originally was using 3.7). However, the problem was still there.
The only thing I did not try was to insall the nightly-builds, as I could not download it within 7 minutes and lost patience.
However, if the intended way of switching is not different from the non DistributedDataParallel case then I am glad. I was just starting out with DistributedDataParallel and was not sure whether its possible to switch modes, or one has to define the mode before using the wrapper or some other magic.
I was just starting out with DistributedDataParallel and was not sure whether it’s possible to switch modes, or one has to define the mode before using the wrapper or some other magic.
DDP’s train() and eval() should work as expected. Just please remember to wrap it with torch.no_grad() when running in eval mode.
In the meantime I also realized that as my only intention is to switch my model to evaluation mode, I can also accomplish it with model.eval() and there is no real need for using model = model.eval().
I leave this here for future reference aiding ppl like me.