Post-Training Quantization using test data?

I am setting up a PTQ (static) workload and trying to understand what is the best practice to calibrate the quantizers.

I am following the new FX Graph mode tutorial here. The Calibration section mentions that in order to initialize the quantizers, some samples “representative of the workload” are used. The text says: “for example a sample of the training data set”. On the other hand, the code snippet loads data from the validation set:

calibrate(prepared_model, data_loader_test)

I would have thought using validation/test data for calibration to be a no-no. However, I found this practice of using validation data for calibration reported elsewhere too, in PyTorch documentation, such as here

“”"Calibrate

  • This example uses random data for convenience. Use representative (validation) data instead.
    “”"

What is the rationale for setting up the quantizers using information from the validation set? Even under the assumption that only a tiny fraction (few examples) of the validation set is used as calibration, wouldn’t this contaminate the evaluation? What about the case where a larger calibration sets is to be used (some PTQ papers report using 1000s of examples)?

Hi @afaso , you are right, and we need to update the tutorials to reflect this. If you want a true eval score of a quantized model, you’d have to use a slice of a training dataset (or some other held-out data) for PTQ calibration.