Maskrcnn_resnet50_fpn wrapped in DataParallel still using single GPU

I wrap maskrcnn_resnet50_fpn in DataParallel as the following:

# create mask rcnn model
num_classes = 2
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_ft = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
in_features = model_ft.roi_heads.box_predictor.cls_score.in_features
model_ft.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
in_features_mask = model_ft.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
model_ft.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask, hidden_layer, num_classes)
model_ft.to(device)

NUM_GPUS = torch.cuda.device_count()
if NUM_GPUS > 1:
    model_ft = torch.nn.DataParallel(model_ft)
model_ft.to(device)

During the training I see only one GPU is busy and other GPUs are idle and empty.