Batch size in Dataparllel

The docs of Dataparalle said that

The batch size should be larger than the number of GPUs used.

when cooperated with dataloader, should we always set drop_last=True in case that batch size equals gpu numbers? Because in some cases, batch size < gpu works fine, but others don’t. Below is an example.

from torch.nn import DataParallel

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.conv = nn.Conv2d(in_channels=1024, out_channels=19, kernel_size=1)

    def forward(self, *input, **kwargs):
        return self.conv(input[0])

    my_model = SimpleModel().cuda()
    my_model = DataParallel(my_model, device_ids=[0, 1])  # use 2 gpus
    input_val = torch.ones(1, 1024, 16, 16).cuda()  # batchsize is 1, smaller than 2
    output = my_model(input_val, arbitaray_arg='test')  # this cause error in the forward() of 2ed gpu 
    output = my_model(input_val)  # this works fine

The questions are where could I find some clue in the implementation of Dataparallel?
How should we avoid this or just set drop_last=True?

drop_last=True provides uniform batch size, so I suppose there is no harm in using it.