Difference of IntxWeightOnlyConfig/UIntxWeightOnlyConfig/Int8WeightOnlyConfig/Int4WeightOnlyConfig/

In torchao, there are many configs with similar names. However, not any detailed document to explain their difference. Welcome to the torchao Documentation — torchao 0.13 documentation

According to my understanding, all of UIntXWeightOnlyConfig, Int8WeightOnlyConfig, Int4WeightOnlyConfig are IntXWeightOnlyConfig with some specific optimization. really?

UIntXWeightOnlyConfig(weight_dtype = torch.uintX) == IntXWeightOnlyConfig(
  weight_dtype = torch.uintX, # torch.uint1 - torch.uint8
  mapping_type = MappingType.ASYMMETRIC
)
Int8WeightOnlyConfig(group_size = 32) = IntXWeightOnlyConfig(
  weight_dtype = torch.torch.int8,
  granularity = PerGroup(group_size = 32)
)
Int4WeightOnlyConfig(group_size = 32) = IntXWeightOnlyConfig(
  weight_dtype = torch.torch.int8,
  granularity = PerGroup(group_size = 32)
)

I read the code and give the following conclusion. here -> means function call:

IntxWeightOnlyConfig(version=2) -> _intx_weight_only_transform -> _intx_weight_only_quantize_tensor ->
IntxUnpackedToInt8Tensor.from_hp

IntxWeightOnlyConfig(version=1) -> _intx_weight_only_transform -> _intx_weight_only_quantize_tensor ->
AffineQuantizedTensor.from_hp_to_intx(
  ...,
  mapping_type = MappingType.SYMMETRIC/...,
  target_dtype=torch.intX,
  quant_min=_DTYPE_TO_QVALUE_BOUNDS[target_dtype][0],
  quant_max=_DTYPE_TO_QVALUE_BOUNDS[target_dtype][1],
)

UIntXWeightOnlyConfig(dtype=torch.uintX) -> _uintx_weight_only_transform ->
AffineQuantizedTensor.from_hp_to_intx(
  ...,
  mapping_type = MappingType.ASYMMETRIC,
  target_dtype=torch.uintX,
  quant_min = None,
  quant_max = None,
)

Int8WeightOnlyConfig -> _int8_weight_only_transform -> _int8_weight_only_quantize_tensor ->
AffineQuantizedTensor.from_hp_to_intx(
  ...,
  mapping_type = MappingType.SYMMETRIC,
  target_dtype = torch.int8,
  quant_min = None,
  quant_max = None,
)

Int4WeightOnlyConfig(version=2) -> _int4_weight_only_transform -> _int4_weight_only_quantize_tensor ->
Int4Tensor.from_hp/Int4PlainInt32Tensor.from_hp/...

Int4WeightOnlyConfig(version=1) -> _int4_weight_only_transform -> _int4_weight_only_quantize_tensor ->
AffineQuantizedTensor.from_hp_to_intx(
  ...,
  mapping_type = MappingType.ASYMMETRIC,
  target_dtype = torch.int32,
  quant_min = 0,
  quant_max = 15,
)

When version=1, Int4WeightOnlyConfig/Int8WeightOnlyConfig/UIntXWeightOnlyConfig/IntxWeightOnlyConfig
is similar call of AffineQuantizedTensor.from_hp_to_intx()
NOTE: According to Migrating from AffineQuantizedTensor + Layouts to new structure of tensor subclasses · Issue #2752 · pytorch/ao · GitHub , version=1 will be deprecated.

version=2 will remove class Layout, which is a wrapper of preprocess/postprocess/…
i.e., version=1 will use AffineQuantizedTensor.from_hp_to_intx/from_hp_to_floatx/…
and diffent AffineQuantizedTensor.TensorImpl.layout to preprocess/postprocess Tensor; version=2 will use different Tensor subclasses
IntxUnpackedToInt8Tensor.from_hp/Int4Tensor.from_hp/… without any layout and TensorImpl

the code of version=2 about many Tensor subclasses is in torchao/quantization/quantize_/workflows/{int4,intx,float8}/_tensor.py
the code of version=1 about AffineQuantizedTensor and many layouts is in torchao/dtypes/
.py and torchao/dtypes/{uintx,floatx}/layout.py
and will be rewritten to torchao/quantization/quantize
/workflows/{uintx,floatx}/_tensor.py

Now for version=2, use IntxWeightOnlyConfig is enough, for x=4, Int4WeightOnlyConfig can give some specific optimization for storage (pack 2 int4 to a int8, or pack 8 int4 to a int32, or …).
Int8WeightOnlyConfig will be rewritten in version=2 and apply similar optimization (pack 4 int8 to a int32 by I guess?)
UIntXWeightOnlyConfig will be moved to prototype, perhaps it mean than it will disappear.

yeah right now it’s not very clear, but we are going to have more clear docs starting from: GitHub - pytorch/ao: PyTorch native quantization and sparsity for training and inference