How to switch model from training to evaluation?

Hello! I am trying to set up a training script using DistributedDataParallel (DDP) where the model changes between training and evaluation modes. However, when I try to switch into evaluation mode with model=model.eval() model becomes a NoneType. I also tried to use model=model.train(False) but the result was the same.

My issue is reproduceable with modifying the DDP example, thus:

import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp

from torch.nn.parallel import DistributedDataParallel as DDP


def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)


def cleanup():
    dist.destroy_process_group()


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.drop1 = nn.Dropout(p=0.6)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.drop1(self.net1(x))))


def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    # Training mode
    print("Training")
    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn(outputs, labels).backward()
    optimizer.step()

    # Evaluation mode
    print("Evaluating")
    ddp_model = ddp_model.eval()
    outputs = ddp_model(torch.randn(20, 10))

    cleanup()


def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)


if __name__ == "__main__":
    run_demo(demo_basic, 1)

What is the proper way of switching between modes DDP? (Or it is not intended to be switched?)

Thank you in advance!

I am pinging this as somebody might have more insight into it than me :slight_smile:

Hello, this issue hasn’t been fixed in 1.5.0, but has been fixed in 1.5.1:

v1.5.0:

    def train(self, mode=True):
        super(DistributedDataParallel, self).train(mode)
        for module in self._module_copies[1:]:
            module.train(mode)

is not returning self

v1.5.1

    def train(self, mode=True):
        self.training = mode
        for module in self.children():
            module.train(mode)
        return self

is returning self.

1 Like

Thank you for your answer.

Strangely enough I am using the version 1.5.1 and the line returning self is present in the train() function. I even tried to reinstall 1.5.1 after cleaning conda cache. Then I created a a new conda environment and installed pytorch with python 3.8 (as I originally was using 3.7). However, the problem was still there.

The only thing I did not try was to insall the nightly-builds, as I could not download it within 7 minutes and lost patience. :sweat_smile:

However, if the intended way of switching is not different from the non DistributedDataParallel case then I am glad. I was just starting out with DistributedDataParallel and was not sure whether its possible to switch modes, or one has to define the mode before using the wrapper or some other magic.

Looks like we still miss that return at least in master. I am not sure whether some earlier changes were applied but got revert or not. Adding it in Let DDP.train() return self to stay consistent with nn.Module by mrshenli · Pull Request #42131 · pytorch/pytorch · GitHub

I was just starting out with DistributedDataParallel and was not sure whether it’s possible to switch modes, or one has to define the mode before using the wrapper or some other magic.

DDP’s train() and eval() should work as expected. Just please remember to wrap it with torch.no_grad() when running in eval mode.

1 Like

I just tried it out with 1.6.0 but it seems your commit did not make it into it (or the issue is elsewhere :slight_smile: )

On the other hand, thank you very much for mentioning torch.no_grad()! It was a feature I was not aware of yet, and helped me out tremendously.

Lastly, thank you @iffiX and @mrshenli for taking your time to answer.
Both of you were a big help!

1 Like

It will be included in v1.7. The branch cut date for v1.6 was a few weeks ago.

In the meantime I also realized that as my only intention is to switch my model to evaluation mode, I can also accomplish it with model.eval() and there is no real need for using model = model.eval().

I leave this here for future reference aiding ppl like me. :slightly_smiling_face:

1 Like

In addition to what @mrshenli said

use this

with torch.no_grad():
    self.model.module.eval()
    out = self.model(input)

instead of this

with torch.no_grad():
    self.model.eval()
    out = self.model(input)