Parameters which did not receive grad for rank

I get the following error when running in distributed (works fine in serial).

 Parameters which did not receive grad for rank 1: model.head.fc_classification_layer.bias, model.head.fc_classification_layer.weight

There error is with the “fc_classification_layer” in this code, which is strange, it’s just a linear layer and the regression layer (also linear) is not giving the same issue.

class ObjPoseRegressionClassificationHead(nn.Module):
    def __init__(self, cfg: CfgNode, input_shape: ShapeSpec):
        super().__init__()
        self.dropout_rate = cfg.MODEL.OBJECT_POSE.HEAD.DROPOUT_RATE
        self.out_shape = [cfg.MODEL.OBJECT_POSE.HEAD.NUM_OUT_VERTICES, 2]
        # Flatten feature map into a vector
        self.flatten_layer = build_obj_pose_flatten(cfg, input_shape)
        # Dropout
        if self.dropout_rate > 0:
            self.dropout = nn.Dropout(self.dropout_rate)
        else:
            self.dropout = None

        # fully connected layer regression layer
        self.fc_regression_layer = nn.Linear(
            input_shape.channels, np.prod(self.out_shape)
        )
        # out dimensions
        self._out_feature_channels = np.prod(self.out_shape)
        self._out_feature_strides = input_shape.width

        # fully connected classification layer
        self.fc_classification_layer = nn.Linear(
            input_shape.channels, cfg.MODEL.OBJECT_POSE.HEAD.NUM_CLASSES,
        )

    def forward(self, x):
        # (N, C, 7, 7)
        x = self.flatten_layer(x)
        # (N, C)
        if self.dropout is not None:
            x = self.dropout(x)
        # (N, C)
        x_regression = self.fc_regression_layer(x)
        # (N, 18)
        x_regression = x_regression.view(
            x.shape[0], self.out_shape[0], self.out_shape[1]
        )
        # (N, 9, 2)

        # NUM_CLASSES
        x_classification = self.fc_classification_layer(x)
        x_classification = F.log_softmax(x_classification, dim=1)

        return x_regression, x_classification

Turns out I was returning a scalar in the loss by calling .item() the loss function. Got rid of .item() and it all works. Should’ve paid more attention with that copy paste.