Torch.nn.functional.linear is slow when built from source

I am building torch from source and seeing slow performance when using torch.nn.functional.linear.

OS: Ubuntu 20.04
commit hash: 4d6314a16e78027832186f5442df888dbabbc159 (9/21/21)

export USE_CUDA="0"
python3.9 bdist_wheel
import datetime
import torch
import torch.nn.functional as F

t1 = torch.ones((32, 1024, 512))
t2 = torch.ones((50257, 512))
e =
print(f"Before F.linear: hour: {e.hour}, minute: {e.minute}, second: {e.second}", flush=True)
logits_parallel = F.linear(t1, t2)
e =    
print(f"After F.linear: hour: {e.hour}, minute: {e.minute}, second: {e.second}", flush=True)

torch built from source takes 15m 15s while the latest torch from PyPI takes 17 s. Any ideas on what might be the issue?

  • Running htop while running the above script shows that only 1 core is getting exercised
  • I am not building with MKL

Building with MKL speeds up F.linear(t1, t2) to what is expected.

Your loss function is programmatically correct except for below:

    # the number of tokens is the sum of elements in mask
    num_tokens = int(torch.sum(mask).data[0])

When you do torch.sum it returns a 0-dimensional tensor and hence the warning that it can’t be indexed. To fix this do int(torch.sum(mask).item()) as suggested or int(torch.sum(mask)) will work too.

Now, are you trying to emulate the CE loss using the custom loss? If yes, then you are missing the log_softmax

To fix that add outputs = torch.nn.functional.log_softmax(outputs, dim=1) before statement 4. Note that in case of tutorial that you have attached, log_softmax is already done in the forward call. You can do that too.

Hi @Cummins597 - I didn’t specify a loss function in my example. Did you mean to reply to this post?