Understanding the ResnetGenerator in the CycleGan Model

Imahn · April 4, 2021, 7:42pm

In their famous research paper on CycleGans (https://arxiv.org/pdf/1703.10593v7.pdf), the authors implement - well, a CycleGan.

There are two discriminators and two generators for the CycleGan. Now, they also provide their neural networks code. I am particularly interested in the generator(s), which they implement here:

github.com

junyanz/pytorch-CycleGAN-and-pix2pix/blob/master/models/networks.py#L119-L159


def define_G(input_nc, output_nc, ngf, netG, norm='batch', use_dropout=False, init_type='normal', init_gain=0.02, gpu_ids=[]):
    """Create a generator

    Parameters:
        input_nc (int) -- the number of channels in input images
        output_nc (int) -- the number of channels in output images
        ngf (int) -- the number of filters in the last conv layer
        netG (str) -- the architecture's name: resnet_9blocks | resnet_6blocks | unet_256 | unet_128
        norm (str) -- the name of normalization layers used in the network: batch | instance | none
        use_dropout (bool) -- if use dropout layers.
        init_type (str)    -- the name of our initialization method.
        init_gain (float)  -- scaling factor for normal, xavier and orthogonal.
        gpu_ids (int list) -- which GPUs the network runs on: e.g., 0,1,2

    Returns a generator

    Our current implementation provides two types of generators:
        U-Net: [unet_128] (for 128x128 input images) and [unet_256] (for 256x256 input images)
        The original U-Net paper: https://arxiv.org/abs/1505.04597

This file has been truncated. show original

So they use a ResnetGenerator, but I’m afraid I do not really understand it yet (cf. lines 119-159 and 315-373).

For the generator, why do we have both downsampling (Conv2d) and upsampling (ConvTranpose2d) layers? I generally know it like this for the generator that the generator only uses ConvTranpose2d layers, where the input is noise sampled from a uniform or Gaussian distribution…

That’s why I am confused…

ptrblck · April 6, 2021, 7:38am

Based on Section7.1 from the paper the authors are reusing the image transformation network from Perceptual Losses for Real-Time Style Transfer and Super-Resolution, which uses this bottleneck architecture. I can’t find more details about this choice so I would assume that this model architecture worked fine for their implementation.