PyTorch Multiprocessing: Train only some parameters at some epochs

ameliatqy · May 27, 2025, 10:47pm

Hello there,

I have question about PyTorch multiprocessing - I noticed that it must update all model parameters for all epochs. I noticed that when I allow my process to update all parameters for all epochs, things work fine. But I have to selectively update only parts of my model for two reasons:

I am training a RL solution and I need to update my value model every epoch while only updating my policy model every 3 epochs or so. So I can update my value model every epoch like normal but have to selectively update my policy model only after every 3 epochs or so.
I have millions of user embeddings to train so updating all parameters for every epoch is very time consuming - I only want to update the embeddings I am training on.

When I only selectively update certain tensors, I get the following error:
terminate called after throwing an instance of ‘gloo::EnforceNotMet’
what(): [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 28690396 vs 10575872

Is there a way to selectively update certain tensors at certain epochs? Is this even possible with torch multiprocessing or should I look at other options? Thank you for any help in advance!

H-Huang · May 28, 2025, 1:56pm

Multiprocessing does not affect your training, its just responsible for starting up multiple processes. In your error message you are using Gloo, so you are using some distributed API, is it just send/recv?

It seems doable to selectively update. To freeze parameter weights you just have to set param.requires_grad=False. Accumulating gradients across epochs would also work calling torch.autograd.backward(..., retain_graph=True)

Do you have an example script to repro? I’m not sure what distributed APIs you are using or what your model looks like.

ameliatqy · May 29, 2025, 11:26pm

Yeah I am using PyTorch Distirbuted and used Gloo for a start (will need to use GPUs and was going to transfer to NCCL after I get this code working). I can’t paste my whole code but I share the problematic snippets. I am training both a value and policy model on multiple GPUs - I update the value model every iteration but the policy model every 3 iterations. It hangs on the first iteration because, I am not sure, but it looks like it was waiting for the policy gradients to be sent:

for epoch in num_epoch:
  ...
  value_model.backwards()
  if epoch > 0 and epoch % 3 == 0:
    policy_model.backwards()

The second issue is that I have 65M user embeddings to train and it does not fit in the GPU. As such, I only shift the embeddings I am training onto the GPU for that batch. This doesn’t work as well as the model waits for the gradients of all the other embeddings to be sent to the main GPU.