Bad inference performance on some CPUs

Dookie · January 24, 2019, 9:32pm

I measured some CPU prediction performance and I got a huge difference in prediction times that I don’t really understand.

I am using this Residual Network with 12 hidden layers for prediction:

github.com

sirmammingtonham/alphastone/blob/master/alphabot/NNet.py

import os
import time
import numpy as np
import sys
sys.path.append('../../')
from utils import Bar, AverageMeter, dotdict

import torch
import torch.optim as optim
from torch.autograd import Variable

from alphanet import DQN as nnet

args = dotdict({
    'lr': 0.001,
    'dropout': 0.3,
    'epochs': 10,
    'batch_size': 64,
    'cuda': True,
    'num_channels': 512,

This file has been truncated. show original

github.com

sirmammingtonham/alphastone/blob/master/alphabot/alphanet.py

'''
neural net output as follows:
21x16 2d tensor = a
    a[0-9,] encode for playing card in hand in position 
    a[10-16,] encode for attacking with minion at position
    a[17,] encode for hero power
    a[18,] encode for hero attack
    a[19,] encode for end turn
    a[20,] encode for card index when given choice
    a[,0-15] encode for targeting available target at index (2 for heroes, 14 for board)
'''
import sys
sys.path.append('..')
import torch
import torch.nn as nn
import torch.nn.functional as F
'''
def get_emb(ni,nf):
    e = nn.Embedding(ni, nf)
    e.weight.data.uniform_(-0.01,0.01)

This file has been truncated. show original

With Pytorch 1.0 (precompiled, no builds from source) a single prediction takes on average
0.022s (no VM Windows 10) or 0.1s (Ubuntu 18.04 VM) on an Intel Core i7-4770K @ 4.2 Ghz and
0.038s (no VM Windows Server 2016 Datacenter) on an Intel Xeon X5680 @ 3.33Ghz but
6.85s on an Opteron 6136 (Ubuntu 18.04 on a VM).
I also got nearly that slow values on an Xeon X5355 (Ubuntu 18.04 on a VM).

No I am trying to figure out what’s the reason for THAT bad performance in comparison to the Intel i7.
Is it because SSE4.1 or 4.2 is not supported?

colesbury · January 24, 2019, 9:50pm

PyTorch uses MKL-DNN (https://github.com/intel/mkl-dnn) for CPU convolutions. It’s optimized for Haswell and newer architectures (circa 2013+). I’ve never tried it on much older processors.

Two suggestions:

Run your program under perf top or perf record to see where the time is spent
Try adjusting OMP_NUM_THREADS. Try setting it to 1 or the number of unused cores (and values in between). Sometimes oversubscription (too many threads) can be a problem.

Dookie · January 25, 2019, 11:21pm

Hey @colesbury, thanks for your quick tips. They were very helpful. It makes sense to me hat some of those CPUs were too old for optimizations. And OMP_NUM_THREADS=1 allowed me to improve my programs performance significantly. Cheers!