Modifying Weight of Pretrained Model in `forward()` Makes Training Slowing

malioboro · January 4, 2023, 5:46pm

I have two parameters, A and B, that I need to put to replace all the weight of the pre-trained model. So I want to utilize the forward calculation of the model but not the weight.

I want to modify the weight of the model W = A + B, where A is a fixed tensor (not trainable), but B is a trainable parameter. So, in the end, my aim is to train B in the structure of the pre-trained model.

This is my attempt:

class Net(nn.Module):
    
    def __init__(self, pre_model, B):
    
        super(Net, self).__init__()
        self.B = B
        with torch.no_grad():
            self.model = copy.deepcopy(pre_model)
            for params in self.model.parameters(): 
                params.requires_grad = False
    
    def forward(self, x, A):
        for i, params in enumerate(self.model.parameters()):
            params.copy_(A[i].detach().clone())
            params.add_(self.B[i])    
            params.retain_grad()

        x = self.model(x)
        return x

I checked in the process, B was already trained. But the problem is in every iteration the training process keeps getting slower:

Epoch 1:
24%|██▍ | 47/196 [00:05<00:23, 6.44it/s]
57%|█████▋ | 111/196 [00:18<00:19, 4.28it/s]
96%|█████████▋| 189/196 [00:41<00:02, 2.90it/s]
Epoch 2:
6%|▌ | 11/196 [00:04<01:14, 2.50it/s]

I think I have detached all the parameters correctly, but I am not sure why it’s happened. I hope somebody here can help me to find out what exactly happened here.
Thanks

ptrblck · January 4, 2023, 9:51pm

Based on your code snippet you are detaching A, which is the fixed tensor, while you are adding B to params potentially including its entire computation graph. Could you double check this, please?

malioboro · January 5, 2023, 12:05am

Basically, this is what I did to get A and B (it’s a little bit different, but I’ve tested the code below, and it’s still giving slowing iteration):

b = []
A = []
for params in list(pre_model.parameters()):
    A.append(torch.rand_like(params))
    b_temp = nn.Parameter(torch.rand_like(params))
    b.append(b_temp.detach().clone())
B = nn.ParameterList(b)

modelwithAB = Net(pre_model, B)
# ...
# in the training iteration
out = modelwithAB(image, A)

malioboro · January 7, 2023, 7:37pm

Hi @ptrblck , from another discussion I read that you suggest using no_grad() context when modifying model parameters. But, in my case, here I didn’t use it because I want B to be updated by the optimizer through add_ operation.

I tried to use no_grad(), and it made the training time more stable (not slowing), but as I expected, it made B not update. I am not sure which one is close to the solution.

ptrblck · January 7, 2023, 10:16pm

I don’t know how exactly your model is used and where the computation graph might be stored as I cannot reproduce any increase in memory usage using:


device = 'cuda'
pre_model = models.resnet18().to(device)
b = []
A = []
for params in list(pre_model.parameters()):
    A.append(torch.rand_like(params))
    b_temp = nn.Parameter(torch.rand_like(params))
    b.append(b_temp.detach().clone())
B = nn.ParameterList(b)

modelwithAB = Net(pre_model, B)
optimizer = torch.optim.Adam(modelwithAB.parameters(), lr=1e-3)

image = torch.randn(2, 3, 224, 224).to(device)
print(torch.cuda.memory_allocated()/1024**2)

for _ in range(10):
    optimizer.zero_grad()
    out = modelwithAB(image, A)
    out.mean().backward()
    optimizer.step()
    print(torch.cuda.memory_allocated()/1024**2)

The print statement shows an approx. constant memory usage, which looks correct.

malioboro · January 8, 2023, 4:57am

Thank you, @ptrblck, for trying to reproduce this error.

I have modified your code by adding tqdm and logging the time for backward() and step(), and even though the memory usage is still the same, you can see that the training time is slowing down.

for i in tqdm(range(300)): # longer run, added tqdm
    optimizer.zero_grad()
    out = modelwithAB(image, A)
    start = time.time() # for logging the backpropagation's duration
    out.mean().backward()
    optimizer.step()
    if i%40==0:
        print("-", torch.cuda.memory_allocated()/1024**2, "-", time.time()-start)

the result from google colab (Google Colab):

  1%|          | 3/300 [00:00<00:10, 28.02it/s] - 1457.7119140625 - 0.02620530128479004
 14%|█▍        | 42/300 [00:02<00:18, 14.10it/s] - 1365.5771484375 - 0.06569838523864746
 27%|██▋       | 82/300 [00:06<00:29,  7.47it/s] - 1365.5693359375 - 0.12723588943481445
 41%|████      | 122/300 [00:13<00:33,  5.37it/s] - 1365.5693359375 - 0.17061519622802734
 54%|█████▎    | 161/300 [00:21<00:33,  4.10it/s] - 1365.5693359375 - 0.23227190971374512
 67%|██████▋   | 201/300 [00:32<00:30,  3.30it/s] - 1365.5693359375 - 0.29410719871520996
 80%|████████  | 241/300 [00:46<00:20,  2.82it/s] - 1365.5693359375 - 0.3430461883544922
 94%|█████████▎| 281/300 [01:02<00:07,  2.41it/s] - 1365.5693359375 - 0.40257716178894043
100%|██████████| 300/300 [01:10<00:00,  4.26it/s]

I have tried to remove tqdm, but the training time is still slowing down.

malioboro · January 10, 2023, 12:44am

Hi @ptrblck I am sorry keep mentioning you, but can you help me check the code above I still got stuck in this problem, and I don’t have any clue how to track/debug in this situation