I read the code and give the following conclusion. here -> means function call:
IntxWeightOnlyConfig(version=2) -> _intx_weight_only_transform -> _intx_weight_only_quantize_tensor ->
IntxUnpackedToInt8Tensor.from_hp
IntxWeightOnlyConfig(version=1) -> _intx_weight_only_transform -> _intx_weight_only_quantize_tensor ->
AffineQuantizedTensor.from_hp_to_intx(
...,
mapping_type = MappingType.SYMMETRIC/...,
target_dtype=torch.intX,
quant_min=_DTYPE_TO_QVALUE_BOUNDS[target_dtype][0],
quant_max=_DTYPE_TO_QVALUE_BOUNDS[target_dtype][1],
)
UIntXWeightOnlyConfig(dtype=torch.uintX) -> _uintx_weight_only_transform ->
AffineQuantizedTensor.from_hp_to_intx(
...,
mapping_type = MappingType.ASYMMETRIC,
target_dtype=torch.uintX,
quant_min = None,
quant_max = None,
)
Int8WeightOnlyConfig -> _int8_weight_only_transform -> _int8_weight_only_quantize_tensor ->
AffineQuantizedTensor.from_hp_to_intx(
...,
mapping_type = MappingType.SYMMETRIC,
target_dtype = torch.int8,
quant_min = None,
quant_max = None,
)
Int4WeightOnlyConfig(version=2) -> _int4_weight_only_transform -> _int4_weight_only_quantize_tensor ->
Int4Tensor.from_hp/Int4PlainInt32Tensor.from_hp/...
Int4WeightOnlyConfig(version=1) -> _int4_weight_only_transform -> _int4_weight_only_quantize_tensor ->
AffineQuantizedTensor.from_hp_to_intx(
...,
mapping_type = MappingType.ASYMMETRIC,
target_dtype = torch.int32,
quant_min = 0,
quant_max = 15,
)
When version=1, Int4WeightOnlyConfig/Int8WeightOnlyConfig/UIntXWeightOnlyConfig/IntxWeightOnlyConfig
is similar call of AffineQuantizedTensor.from_hp_to_intx()
NOTE: According to Migrating from AffineQuantizedTensor + Layouts to new structure of tensor subclasses · Issue #2752 · pytorch/ao · GitHub , version=1 will be deprecated.
version=2 will remove class Layout, which is a wrapper of preprocess/postprocess/…
i.e., version=1 will use AffineQuantizedTensor.from_hp_to_intx/from_hp_to_floatx/…
and diffent AffineQuantizedTensor.TensorImpl.layout to preprocess/postprocess Tensor; version=2 will use different Tensor subclasses
IntxUnpackedToInt8Tensor.from_hp/Int4Tensor.from_hp/… without any layout and TensorImpl
the code of version=2 about many Tensor subclasses is in torchao/quantization/quantize_/workflows/{int4,intx,float8}/_tensor.py
the code of version=1 about AffineQuantizedTensor and many layouts is in torchao/dtypes/.py and torchao/dtypes/{uintx,floatx}/layout.py
and will be rewritten to torchao/quantization/quantize/workflows/{uintx,floatx}/_tensor.py
Now for version=2, use IntxWeightOnlyConfig is enough, for x=4, Int4WeightOnlyConfig can give some specific optimization for storage (pack 2 int4 to a int8, or pack 8 int4 to a int32, or …).
Int8WeightOnlyConfig will be rewritten in version=2 and apply similar optimization (pack 4 int8 to a int32 by I guess?)
UIntXWeightOnlyConfig will be moved to prototype, perhaps it mean than it will disappear.