Training two models sequentially, having different batch sizes

Hello, I have two models, Model A which is very big and I can only fit a batch size of 4 on it. Model B is very small but requires an input of 64x512. Since Model A can only support a batch size of 4, it can produce a vector of Dimension 2x512 per iteration. I have the following code:

optimizer = torch.optim.Adadelta(list(ModelB.parameters()) + list(ModelA.parameters()), lr=0.01, eps=1e-8)
input_for_B = []
For x in range(32):
      output_A = ModelA(input_for_A)
      input_for_B = [output_A]
input_for_B = reshape(input_for_B)#reshape to 64x512 tensor
output_B = ModelB(input_for_B)
loss = criterion(output_B,targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()

I am trying to propagate the same loss in both ModelA and ModelB. Is this implementation correct, because I am not sure if PyTorch will save the gradients for each mini-batch or just the last one that was forward passed for ModelA.

Hi,
Is this code complete?
Because according to this, you are just repeating (looping over) the following two statements -

output_A = ModelA(input_for_A)
input_for_B = [output_A]

I am not sure what’s happening here, but there’ll be just one update to the parameters according to this code as you haven’t included the code for updating the params in any loop.

Im sorry it was supposed to be like this:

optimizer = torch.optim.Adadelta(list(ModelB.parameters()) + list(ModelA.parameters()), lr=0.01, eps=1e-8)
input_for_B = []
For x in range(32):
      output_A = ModelA(input_for_A)
      input_for_B.append(output_A)
input_for_B = reshape(input_for_B)#reshape to 64x512 tensor
output_B = ModelB(input_for_B)
loss = criterion(output_B,targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()

Also doing some research on the topic I have learned that the computational graph created by PyTorch is what takes up most of the memory. There seems to be no way to transfer the computational graph to the system ram instead of GPU VRAM, as far as I know.

From what it looks, it should be able to update the params of both ModelA and ModelB once and yes, the update shall be based on all the 32 mini-batches which essentially means all the tensors that are getting appended to the list input_for_B will have their computation graphs associated with them.

As an example to answer this -

See the following code where I am iterating over each mini-batch that was forward passed to ModelA -

ModelA = torch.nn.Linear(in_features=2, out_features=1)
ModelB = torch.nn.Linear(in_features=1, out_features=1)

optimizer = torch.optim.Adam(list(ModelB.parameters()) + list(ModelA.parameters()), lr=0.5)
input_for_B = []
for i in range(2):
      input_for_A = torch.tensor([i+1, (i+1)*0.1])
      output_A = ModelA(input_for_A)
      input_for_B.append(output_A)

# iterate over each mini-batch in input_for_B

for b in input_for_B:
  output_B = ModelB(b)
  loss = torch.pow((output_B-100.0), 2)
  optimizer.zero_grad()
  loss.backward()

  print('model A')
  for params in (ModelA.parameters()):
    print(f"parameter's grad = {params.grad} and {params}")
  
  print('model B')
  for params in (ModelB.parameters()):
    print(f"parameter's grad = {params.grad} and {params}")
    
  print('\n')
  optimizer.step()
  print('parameters after updation')
  print('model A')
  for params in (ModelA.parameters()):
    print(params)
  print('model B')
  for params in (ModelB.parameters()):
    print(params)

out -

model A
parameter's grad = tensor([[-25.7507,  -2.5751]]) and Parameter containing:
tensor([[ 0.1257, -0.3381]], requires_grad=True)
parameter's grad = tensor([-25.7507]) and Parameter containing:
tensor([-0.2395], requires_grad=True)
model B
parameter's grad = tensor([[29.7235]]) and Parameter containing:
tensor([[0.1278]], requires_grad=True)
parameter's grad = tensor([-201.4254]) and Parameter containing:
tensor([-0.6939], requires_grad=True)


parameters after updation
model A
Parameter containing:
tensor([[0.6257, 0.1619]], requires_grad=True)
Parameter containing:
tensor([0.2605], requires_grad=True)
model B
Parameter containing:
tensor([[-0.3722]], requires_grad=True)
Parameter containing:
tensor([-0.1939], requires_grad=True)
model A
parameter's grad = tensor([[149.1208,  14.9121]]) and Parameter containing:
tensor([[0.6257, 0.1619]], requires_grad=True)
parameter's grad = tensor([74.5604]) and Parameter containing:
tensor([0.2605], requires_grad=True)
model B
parameter's grad = tensor([[11.1456]]) and Parameter containing:
tensor([[-0.3722]], requires_grad=True)
parameter's grad = tensor([-200.3463]) and Parameter containing:
tensor([-0.1939], requires_grad=True)


parameters after updation
model A
Parameter containing:
tensor([[ 0.3161, -0.1478]], requires_grad=True)
Parameter containing:
tensor([0.0181], requires_grad=True)
model B
Parameter containing:
tensor([[-0.8165]], requires_grad=True)
Parameter containing:
tensor([0.3061], requires_grad=True)