Faceswap Training

Hi there!

I wanted to train a model for faceswapping which is basically using a shared encoder for 2 decoders (1 for each face). I didn’t put them all in one model, but I kept it modular as indicated in the code below. My question is, if I don’t include the weights of the encoder in the optimizer, do they ever get updated in the training stage?

    # Instantiate three different modules
    encoder = model.module.encoder
    decoder_potter = copy.deepcopy(model.module.decoder)
    decoder_chua = copy.deepcopy(model.module.decoder)

    # Other hyperparameters
    criterion = nn.MSELoss()
    params_chua = list(decoder_chua.parameters()) 
    params_potter = list(decoder_potter.parameters()) 
    
    optimizer_chua = torch.optim.SGD(
        filter(lambda p: p.requires_grad, params_chua),
        args.lr,
        weight_decay=args.weight_decay)
    optimizer_potter = torch.optim.SGD(
        filter(lambda p: p.requires_grad, params_potter),
        args.lr,
        weight_decay=args.weight_decay)

Basically, I only have these two optimizers during training. My training loop can be seen below:

    for i, (img_chua, img_potter) in tqdm(enumerate(zip(train_loader_chua, cycle(train_loader_potter)))):
        img_chua, img_potter = img_chua[0].cuda(), img_potter[0].cuda()

        img_chua_recon = decoder_chua(encoder(img_chua))
        chua_loss = criterion(img_chua_recon, img_chua)
        total_chua_loss += chua_loss
        optimizer_chua.zero_grad()
        chua_loss.backward()
        optimizer_chua.step()

        img_potter_recon = decoder_potter(encoder(img_potter))
        potter_loss = criterion(img_potter_recon, img_potter)
        total_potter_loss += potter_loss
        optimizer_potter.zero_grad()
        potter_loss.backward()
        optimizer_potter.step()

Thank you very much.

No, only parameters passed to an optimizer with valid .grad attributes will be updated in the corresponding optimizer.step() calls (unless you manually manipulate parameters of course, but that’s not the question here).

Note that encoder’s parameters will still get valid gradients since you are not detaching its output.
If you never intend to train encoder you could wrap its forward pass into a with torch.no_grad() guard to disable Autograd and to avoid storing intermediate forward activations, which would be needed to compute the gradients.

Also note that some layers will update their internal buffers in each forward pass if model.train() is used, e.g. batchnorm layer will update their running stats using the current batch stats.
To disable this you should call model.eval() before executing the forward pass.

I understand. Thank you very much for the reply!