No int8 4x speedup is observed

Marat · June 26, 2019, 1:37pm

How can I verify that my model was properly quantized with int8 instructions? Are there something stored or logged for me to check? I collected profile and run model in OpenCL mode and did not see significant performance difference. I use Pascal 1050 Ti which is 6.1 compute compatible device which must support efficient Int8 vector dot product operation (d4pa??). I clearly see GPU boost over CPU backend but almost no quanitization caused speedup. What is the trick?

jfix · June 26, 2019, 5:39pm

One reason could be that different models will benefit differently from quantization. If your model’s weights aren’t huge then perhaps you were already close to compute bound in the float version.

Another reason could be that we haven’t spent a ton of time optimizing our OpenCL kernels. Perhaps they aren’t using the best instructions possible. This is something we would love for improvement on. You can find the kernels in the .cl files located here.

Marat · June 27, 2019, 12:17pm

If your model’s weights aren’t huge then perhaps you were already close to compute bound in the float version.

Sounds weird. I have model with completely stupid weights like {-1, 0, 1} and observe no performance gain after I apply --load-profile option.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(3, 8, 3)
        self.conv2 = nn.Conv2d(8, 16, 3)
        self.conv3 = nn.Conv2d(16, 32, 3)
        self.conv4 = nn.Conv2d(32, 64, 3)
        self.conv5 = nn.Conv2d(64, 128, 3)
        self.conv6 = nn.Conv2d(128, 1000, 3)

        # Replce all weights by dummy 0 or 1
        entities = [self.conv1, self.conv2, self.conv3, self.conv4, self.conv5, self.conv6]

        for e in entities:
            e.weight[:] = torch.randint(-1, 1, e.weight.size())
            e.bias[:]   = torch.randint(-1, 1, e.bias.size())

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv3(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv4(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv5(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv6(x)), (2, 2))
        x = x.view(16, 1000)
        return x

And collected profile

---
- nodeOutputName:  'learned_101:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'learned_81:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'learned_41:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'learned_21:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'relu5:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A291:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'learned_9:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'zero5:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'relu4:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A261:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'learned_5:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'zero4:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'learned_11:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'A252:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'data:0'
  scale:           0.00392157
  offset:          -128
- nodeOutputName:  'learned_61:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'learned_3:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'learned_1:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'zero2:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A241:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'learned_7:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'relu2:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A201:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A222:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'A181:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A192:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'A211:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'relu:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A141:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A151:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'save_output:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A301:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A132:0'
  scale:           0.0705882
  offset:          127
- nodeOutputName:  'zero:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A162:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'A271:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A282:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'learned_01:0'
  scale:           0.00392157
  offset:          127
- nodeOutputName:  'A131:0'
  scale:           0.00392157
  offset:          -128
- nodeOutputName:  'output1:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'zero1:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'relu3:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A231:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'relu1:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'A171:0'
  scale:           0.1
  offset:          0
- nodeOutputName:  'zero3:0'
  scale:           0.1
  offset:          0
...

I do not believe that glow as AI compiler do not have any debugging tools which includes logs and some intermediate IR representations for observations and analysis.

jfix · June 27, 2019, 4:03pm

Sounds weird. I have model with completely stupid weights like {-1, 0, 1} and observe no performance gain after I apply --load-profile option.

I did not mean the literal values of the weights, I meant the byte size of the weights themselves, e.g. how many bytes the weights of each conv layer take up. One benefit of quantization is that it shrinks the number of bytes the weights take up by 4x.

I do not believe that glow as AI compiler do not have any debugging tools which includes logs and some intermediate IR representations for observations and analysis.

We have a Graph based high level IR which you can dump a dot file of a DAG representation of it. This is via command line option -dump-graph-DAG="file.dot". We also have the ability to dump our serialized low-level Instruction IR to stdout, via command line option -dump-ir.