Training deep copied modules yields unexpected results

I am currently looking for the optimal learning rate when training a GAN. Therefore I generated a generator and a discriminator model and copied both models tree times to evaluate four different learning rates.

For copying I tried both

copy.deepcopy(module)

and

module_copy = copy.deepcopy(module)
module_copy.load_state_dict(module.state_dict())

However both approaches yield strange results when training: The results highly indicate that training the second GAN does not start from scratch but continues where the training of the first model ended. The third GAN continues where the second ended etc.

I checked that the models do not share parameters. After training, different model’s parameters have different values. I do not have a clue what the problem is.

These are the modules of the generator, the discriminator is very similar:

0: MFCCGenerator(
  (model_before): Sequential(
    (0): CombinedLinear(
      (layers): Sequential(
        (0): Linear(in_features=174, out_features=50, bias=True)
        (1): LayerNorm(torch.Size([50]), eps=1e-05, elementwise_affine=True)
        (2): LeakyReLU(negative_slope=0.01)
      )
    )
    (1): AlwaysDropout(p=0.5)
    (2): CombinedLinear(
      (layers): Sequential(
        (0): Linear(in_features=50, out_features=15, bias=True)
        (1): LayerNorm(torch.Size([15]), eps=1e-05, elementwise_affine=True)
        (2): LeakyReLU(negative_slope=0.01)
      )
    )
  )
  (model_after): Sequential(
    (0): CombinedLinear(
      (layers): Sequential(
        (0): Linear(in_features=23, out_features=20, bias=True)
        (1): LayerNorm(torch.Size([20]), eps=1e-05, elementwise_affine=True)
        (2): LeakyReLU(negative_slope=0.01)
      )
    )
    (1): AlwaysDropout(p=0.5)
    (2): CombinedLinear(
      (layers): Sequential(
        (0): Linear(in_features=20, out_features=16, bias=True)
        (1): LayerNorm(torch.Size([16]), eps=1e-05, elementwise_affine=True)
        (2): LeakyReLU(negative_slope=0.01)
      )
    )
    (3): Linear(in_features=16, out_features=13, bias=True)
  )
)

1: Sequential(
  (0): CombinedLinear(
    (layers): Sequential(
      (0): Linear(in_features=174, out_features=50, bias=True)
      (1): LayerNorm(torch.Size([50]), eps=1e-05, elementwise_affine=True)
      (2): LeakyReLU(negative_slope=0.01)
    )
  )
  (1): AlwaysDropout(p=0.5)
  (2): CombinedLinear(
    (layers): Sequential(
      (0): Linear(in_features=50, out_features=15, bias=True)
      (1): LayerNorm(torch.Size([15]), eps=1e-05, elementwise_affine=True)
      (2): LeakyReLU(negative_slope=0.01)
    )
  )
)

2: CombinedLinear(
  (layers): Sequential(
    (0): Linear(in_features=174, out_features=50, bias=True)
    (1): LayerNorm(torch.Size([50]), eps=1e-05, elementwise_affine=True)
    (2): LeakyReLU(negative_slope=0.01)
  )
)

3: Sequential(
  (0): Linear(in_features=174, out_features=50, bias=True)
  (1): LayerNorm(torch.Size([50]), eps=1e-05, elementwise_affine=True)
  (2): LeakyReLU(negative_slope=0.01)
)

4: Linear(in_features=174, out_features=50, bias=True)

5: LayerNorm(torch.Size([50]), eps=1e-05, elementwise_affine=True)

6: LeakyReLU(negative_slope=0.01)

7: AlwaysDropout(p=0.5)

8: CombinedLinear(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=15, bias=True)
    (1): LayerNorm(torch.Size([15]), eps=1e-05, elementwise_affine=True)
    (2): LeakyReLU(negative_slope=0.01)
  )
)

9: Sequential(
  (0): Linear(in_features=50, out_features=15, bias=True)
  (1): LayerNorm(torch.Size([15]), eps=1e-05, elementwise_affine=True)
  (2): LeakyReLU(negative_slope=0.01)
)

10: Linear(in_features=50, out_features=15, bias=True)

11: LayerNorm(torch.Size([15]), eps=1e-05, elementwise_affine=True)

12: LeakyReLU(negative_slope=0.01)

13: Sequential(
  (0): CombinedLinear(
    (layers): Sequential(
      (0): Linear(in_features=23, out_features=20, bias=True)
      (1): LayerNorm(torch.Size([20]), eps=1e-05, elementwise_affine=True)
      (2): LeakyReLU(negative_slope=0.01)
    )
  )
  (1): AlwaysDropout(p=0.5)
  (2): CombinedLinear(
    (layers): Sequential(
      (0): Linear(in_features=20, out_features=16, bias=True)
      (1): LayerNorm(torch.Size([16]), eps=1e-05, elementwise_affine=True)
      (2): LeakyReLU(negative_slope=0.01)
    )
  )
  (3): Linear(in_features=16, out_features=13, bias=True)
)

14: CombinedLinear(
  (layers): Sequential(
    (0): Linear(in_features=23, out_features=20, bias=True)
    (1): LayerNorm(torch.Size([20]), eps=1e-05, elementwise_affine=True)
    (2): LeakyReLU(negative_slope=0.01)
  )
)

15: Sequential(
  (0): Linear(in_features=23, out_features=20, bias=True)
  (1): LayerNorm(torch.Size([20]), eps=1e-05, elementwise_affine=True)
  (2): LeakyReLU(negative_slope=0.01)
)

16: Linear(in_features=23, out_features=20, bias=True)

17: LayerNorm(torch.Size([20]), eps=1e-05, elementwise_affine=True)

18: LeakyReLU(negative_slope=0.01)

19: AlwaysDropout(p=0.5)

20: CombinedLinear(
  (layers): Sequential(
    (0): Linear(in_features=20, out_features=16, bias=True)
    (1): LayerNorm(torch.Size([16]), eps=1e-05, elementwise_affine=True)
    (2): LeakyReLU(negative_slope=0.01)
  )
)

21: Sequential(
  (0): Linear(in_features=20, out_features=16, bias=True)
  (1): LayerNorm(torch.Size([16]), eps=1e-05, elementwise_affine=True)
  (2): LeakyReLU(negative_slope=0.01)
)

22: Linear(in_features=20, out_features=16, bias=True)

23: LayerNorm(torch.Size([16]), eps=1e-05, elementwise_affine=True)

24: LeakyReLU(negative_slope=0.01)

25: Linear(in_features=16, out_features=13, bias=True)

Thanks for every idea you might have!

How are the results indicating that the training is being continued?
If you checked, that the models have different parameters, could you just pass a tensor with all ones through the trained model and the randomly initialized model and compare the outputs?
I assume both the generator and discriminator are copied?
Do you also create new optimizers for each GAN?

I am looking at estimates of Wasserstein distances between different classes of training data and generated data. For copies of the same generator, those estimates should be almost identical before training. However, the distances at the beginning of the second training of a generator look nearly identical as the distance estimates after training the first generator with 5000 batches of data.

At the moment I do not have an initialized model in memory. I could run this experiment later. What I can do is to compare the results of the four trained models still in memory. Those are (averaged over 1000 outputs due to dropout in both evaluation and training mode):

tensor([ 1.4081,  0.6046, -0.5007,  0.1663, -0.5247, -0.4271, -0.0108,
        -0.2019,  0.0682, -0.9449,  0.0687, -0.0462, -0.0376], device='cuda:0')
tensor([ 1.2530,  0.3588, -0.2415,  0.0276,  0.1169, -0.5837, -0.2653,
        -0.7568,  0.1672, -0.4251, -0.1818, -0.0518, -0.5182], device='cuda:0')
tensor([ 1.3222, -0.5203, -0.0297,  0.7092,  0.2500, -0.6458, -0.5713,
         0.1600, -0.4600, -0.9698, -0.4975, -0.4153, -0.2564], device='cuda:0')
tensor([ 1.5271,  0.0421, -0.4180,  0.3413, -0.4186, -0.0325, -0.3692,
         0.4058,  0.1961, -0.9888, -0.1224,  0.5514,  0.0256], device='cuda:0')

Yes.

Yes.

I also make sure that the gradients towards the inputs are not saved over training batches.

I ran the experiment again and checked the outputs of each generator before, between and after training. These are the results:

Before training:

 tensor([-0.0532,  0.4151,  0.3573, -0.2433,  0.1307, -0.0836,  0.2847,
        -0.1178,  0.1347,  0.1278, -0.1246, -0.0588,  0.3419], device='cuda:0')
tensor([-0.0565,  0.4017,  0.3659, -0.2477,  0.1339, -0.0836,  0.2784,
        -0.1184,  0.1386,  0.1262, -0.1121, -0.0488,  0.3315], device='cuda:0')
tensor([-0.0594,  0.3994,  0.3677, -0.2517,  0.1434, -0.0990,  0.2837,
        -0.1150,  0.1401,  0.1330, -0.1092, -0.0620,  0.3439], device='cuda:0')
tensor([-0.0522,  0.4125,  0.3546, -0.2467,  0.1256, -0.0851,  0.2884,
        -0.1105,  0.1383,  0.1332, -0.1343, -0.0499,  0.3382], device='cuda:0')

After training the first generator:

tensor([ 1.2326,  0.1330, -0.2058, -0.2644, -0.1080,  0.1575,  0.2618,
         0.0960,  0.3211,  0.4803, -0.0785,  0.2011, -0.3083], device='cuda:0')
tensor([-0.0563,  0.4148,  0.3497, -0.2479,  0.1295, -0.0838,  0.2905,
        -0.0990,  0.1363,  0.1258, -0.1307, -0.0527,  0.3313], device='cuda:0')
tensor([-0.0562,  0.3983,  0.3653, -0.2451,  0.1365, -0.0950,  0.2755,
        -0.1169,  0.1376,  0.1209, -0.1104, -0.0573,  0.3361], device='cuda:0')
tensor([-0.0639,  0.3964,  0.3724, -0.2654,  0.1279, -0.1069,  0.2877,
        -0.1119,  0.1249,  0.1338, -0.1262, -0.0422,  0.3374], device='cuda:0')

After training the second generator:

tensor([ 1.2509,  0.1938, -0.1883, -0.2593, -0.0736,  0.0829,  0.1862,
         0.0384,  0.3301,  0.5389, -0.0262,  0.2488, -0.3501], device='cuda:0')
tensor([ 1.4068, -0.4611, -0.4623,  0.2457,  0.0790, -0.7879, -0.1040,
        -0.0594,  0.0446,  0.0345, -0.2796, -0.0319, -0.4630], device='cuda:0')
tensor([-0.0517,  0.4005,  0.3747, -0.2566,  0.1396, -0.0952,  0.2775,
        -0.1243,  0.1364,  0.1313, -0.0991, -0.0516,  0.3421], device='cuda:0')
tensor([-0.0709,  0.3985,  0.3652, -0.2412,  0.1333, -0.0953,  0.2782,
        -0.1052,  0.1356,  0.1420, -0.1292, -0.0532,  0.3499], device='cuda:0')

After training the third generator:

tensor([ 1.2610,  0.2675, -0.2377, -0.3138, -0.1379,  0.1104,  0.2437,
         0.0204,  0.3022,  0.5029,  0.0027,  0.2320, -0.3861], device='cuda:0')
tensor([ 1.4057, -0.4849, -0.5250,  0.3343,  0.1568, -0.8417, -0.0839,
        -0.0798,  0.0577,  0.0663, -0.3039, -0.0474, -0.4607], device='cuda:0')
tensor([ 1.4200,  0.1538,  0.4493,  1.2378,  0.8067, -1.0643, -0.2824,
         0.6857, -0.3737, -0.0873, -0.2991,  0.3936, -0.4183], device='cuda:0')
tensor([-0.0647,  0.4080,  0.3546, -0.2470,  0.1262, -0.0913,  0.2868,
        -0.0972,  0.1294,  0.1270, -0.1409, -0.0572,  0.3390], device='cuda:0')

After training all four generators:

tensor([ 1.2381,  0.1970, -0.2328, -0.3427, -0.0975,  0.1496,  0.2497,
         0.0587,  0.2780,  0.5047,  0.0048,  0.2039, -0.3336], device='cuda:0')
tensor([ 1.3951, -0.4659, -0.5315,  0.2548,  0.0990, -0.7395, -0.0883,
        -0.0192,  0.1091,  0.0959, -0.2537,  0.0104, -0.4816], device='cuda:0')
tensor([ 1.4325,  0.1591,  0.4397,  1.3557,  0.8864, -1.1142, -0.3061,
         0.6200, -0.3217, -0.0575, -0.3192,  0.4867, -0.3381], device='cuda:0')
tensor([ 1.0634, -0.4836,  0.4042, -0.5154,  0.0880, -0.2594,  0.0965,
         0.2983,  0.6161, -0.2963,  0.7156,  0.2674, -0.5459], device='cuda:0')