Net in DataParallel make training aware quantization convert model acc error

dingyongchao · April 23, 2020, 8:48am

I run net train and test in single GPU is good, and I try to train model in multi-gpus, and I found that the process of training is good, however, when I convert model into quantization, and the model acc is 1% in CIFAR100 which means random.
My convert code is as follows:

126     if gpus == 1:
127         quantized_model = torch.quantization.convert(net.eval().to(device), inplace=False)
128     else:
129         print("expert model from torch DataParallel")
130         single_net = deepcopy(net.module)
131         #single_net = net
132         quantized_model = torch.quantization.convert(single_net.eval().to(device), inplace=False)
133         #quantized_model = quantized_model.module
134     quantized_model.eval()

if gpu number is more than one, I use net = torch.nn.DataParallel(net), so when I need to convert it, I first do single_net = deepcopy(net.module), but the acc of quantized_model is 1% , well the training acc is about 77%. Meanwhile, it has warning:

/home/autolab/anaconda3/lib/python3.7/site-packages/torch/quantization/observer.py:208: UserWarning: Must run observer before calling calculate_qparams. Returning default scale and zero point.
Returning default scale and zero point.")

So if anyone has any ideas?

supriyar · April 23, 2020, 4:18pm

There are currently some issues with nn.DataParallel and Quantization Aware Training. There is a WIP PR to fix it - https://github.com/pytorch/pytorch/pull/37032
You can follow the toy example here to make sure you’re following the steps for QAT correctly https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

dingyongchao · April 24, 2020, 1:18am

Thanks， I think I has solve this problem by follows:

178     test(net.module, epoch, device, args.gpus)
179     qat_model = deepcopy(net)
180     qat_model.eval().to('cpu')
181     torch.quantization.convert(qat_model, inplace=True)
182     test(qat_model.module, epoch, 'cpu', args.gpus)
183     scheduler.step()

I need to run net.module first, then the qat_model in cpu is correct