I am trying to train on multiple GPUs over multiple nodes using distributed data parallel (DDP). Specifically, I have two models A and B. Model B uses the output of A to perform the task. I have model A trained already using a single GPU and I don’t want to train it further. However, I want to train model B using the outputs of model A on a larger batch size so I need to set up both models using DDP. However, DDP raises an error (https://github.com/pytorch/pytorch/issues/25550) when a module has no parameter that requires gradient. Please let me know on how to implement this properly. Thanks in advance.