Loss function compute speedup during per-sample inference?

What are some best practices on computing loss functions during inference, e.g. wrt speedup on CPU?

Example: we have a trained autoencoder and at inference time we use its encoder, decoder, and (mse) loss to compute reconstruction error for each sample. Executing the code below takes ~5 minutes for 100K samples (500 features) which seems slow and I’d appreciate any tips on speed-up.

Questions: are there less or more optimal operations we could do (e.g. reducing autograd/tensor casting) or is it recommended to go straight to trying parallel computation using python’s multiprocessing module?
Quantization could also help but I’d like to see if there are prior considerations.

from torch import nn
from torch import autograd

loss_function = nn.MSELoss(reduction='mean')

# encoder and decoder loaded from trained model
encoder_eval, decoder_eval = encoder.cpu().eval(), decoder.cpu().eval()

# X is of type torch.Tensor, shape = (100000, 500)
eval_data_as_autograd = autograd.Variable(X)
reconstruction = decoder_eval(encoder_eval(eval_data_as_autograd))
reconstruction = reconstruction.detach().numpy()
reconstruction_tensor = torch.Tensor(reconstruction)

# compute per sample reconstruction errors
reconstruction_errors = [loss_function(r, x).item() for r, x in zip(reconstruction_tensor, X)]

# len(reconstruction_errors) = 100000 and its type is List[float]

I don’t know what your code is doing exactly, but a few things to notice:

  • use a GPU (if possible) for a speedup
  • remove the usage of autograd.Variable as it’s deprecated since PyTorch 0.4.0
  • avoid casting to numpy and back to a tensor as you could just detach it (this might not have a huge impact since you are already using the CPU, but seems unnecessary)
  • try to avoid the loop in the loss calculation and check if a loss_function could accept the entire batch

Thank you…Here is one way except that I don’t get the per-sample loss via .item() as in the original post:

# X is of type torch.Tensor, shape = (100000, 500)
reconstruction = decoder_eval(encoder_eval(X))
loss_values = loss_function(reconstruction, X)

# loss_values.item()  != [loss_function(r, x).item() for r, x in zip(reconstruction_tensor, X)]

Below are two working solutions that are faster (~5X) than the original post. The basic problem is that nn.MSELoss “reduction” parameter’s documentation is ambiguous (to me) with respect to the original problem statement of per-sample reconstruction error at inference time (not training time).
The first solution is with torch tensors and the second with numpy arrays.

mse_tensor = torch.mean(torch.square(X - reconstruction), axis=1)

The above solution is faster than the original post but appears to be about 12% slower on cpu than the following:

X_np = X.numpy()
Y_np = reconstruction.detach().numpy()
mse_np = np.mean(np.square(X_np - Y_np), axis=1)