Hierarchical model goes to NaN if rebuilt during training session, works if rebuilt at beginning of training session

Have been running into a bit of a strange issue here that I’m not too sure on. I have a model I’ve been working on that uses progressive resizing and a hierarchical architecture to help with training, but I seem to be running into a strange issue with it seemingly out of nowhere (this has previously worked without issue, although it was back with I believe torch 1.8). When I build the new stage from a checkpoint of the previous stage at the beginning of the training app, everything works fine. When my code switches to the next resolution itself it goes to NaN immediately (I have variables to specify which stage to start training on as well as which stage to load the checkpoint from, so in this case I would have stage = 1, load_stage = 0). I’m using mixed precision training but this has worked in the past with mixed precision so I don’t believe that’s the issue.

For initial loading (when starting the app fresh at a specified stage), the code looks like the following:

    current_resolution = 0
    load_stage = 0

    generation_config = [
        ("G://cs16_sr44100_hl16384_nf128_of0", 1e-3, 32, 16, True, 0, 8, 4, None), # 0
        ("G://cs32_sr44100_hl8192_nf256_of0", 1e-3, 32, 32, True, 1, 8, 3, None), # 1
        ("G://cs64_sr44100_hl4096_nf512_of0", 1e-3, 32, 64, True, 2, 8, 2, None), # 2
        ("G://cs128_sr44100_hl2048_nf1024_of0", 1e-3, 32, 128, True, 3, 8, 1, None), # 3
        ("cs256_sr44100_hl1024_nf2048_of0", 1e-3, 10, 256, True, 4, 8, 0, "G://cs256_sr44100_hl1024_nf2048_of0_VOCALS") # 4
    ]

    model = None
    for stage in range(current_resolution + 1):
        if generation_config[stage][4] or model is None:        
            model = Net() if model is None else Net(model, stage=generation_config[stage][5], prev_requires_grad=prev_requires_grad)

        if load_stage == stage and args.model is not None:
            model.load_state_dict(torch.load(args.model))
    
    model.to(device)

    optimizer = torch.optim.Adam(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=generation_config[current_resolution][1]
    )

During the training loop, once it goes through all epochs in its current stage, it will rebuild itself like so:

                    model = Net(model, stage=generation_config[current_resolution][5], prev_requires_grad=prev_requires_grad).to(device)
                    optimizer = torch.optim.Adam(
                        filter(lambda p: p.requires_grad, model.parameters()),
                        lr=generation_config[current_resolution][1])
                    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=4, verbose=True)

For some reason, this first one will work while the second will not. In both cases it uses the same code, however if I set current_resolution to 1 and load_stage to 0 everything works fine and the model continues learning and rapidly overtakes the previous stage, however if I set current_resolution to 0 and load_stage to 0, perform one epoch on the first stage and then have the model rebuild itself and instantiate a new optimizer, everything shoots to NaN. Does anyone know why this would be happening? It’s all using the same code, though I wouldn’t be surprised if I’m missing something obvious here. Hopefully this is enough code to showcase the issue I’m having, can always upload the project to github I suppose if not. It doesn’t appear to be an issue with the rebuilding code as I am able to build a hierarchical model using the checkpoint without issue. Not exactly a huge issue, but definitely makes it a bit more difficult to have this training while I sleep now which is unfortunate.

Any tips on where to look for issues here would be greatly appreciated! This has worked in the past so I’m at a bit of a loss here, although it was using a previous version of PyTorch (but I somehow doubt this is on the PyTorch side but I did recently upgrade to 1.10).

Edit: After doing a bit of poking around, it seems that the hierarchical model works perfectly fine as long as prev_requires_grad is set to True in my code (this is passed into each module of the previous stage using requires_grad_()). If its false and the previous stage is locked, I am unable to rebuild the model during training and have to start a new training session in order for it to not go to NaN immediately.

So to recap: Loss goes to NaN if training starts on stage 0 and rebuilds for stage 1 IF prev_requires_grad is set to False and it is done during the same training session. Loss does not go to NaN if prev_requires_grad is set to False if a new training session is started and the previous stage’s checkpoint is loaded before building the next stage. Loss also does not go to NaN if prev_requires_grad is set to True, and will continue learning after rebuilding itself without issue (again only if prev_requires_grad is set to True, if its set to False a new training session is required for stability)

Still confused as to what happened but I’m guessing it could be something relating to autograd that I’m unaware of. I coded a simple workaround for now while on lunch. Instead of reusing the same model I just set it to None and then rebuild it up and load it like I do in the beginning by loading from checkpoint and now it all works nicely.