Specified kernel cache directory could not be created with custom loss function

Andrew_Hollis · November 19, 2022, 7:36am

Hi all! I am very new to PyTorch, and I am trying to implement a custom loss function, but when I try to evaluate the function on cuda I get this error:

UserWarning: Specified kernel cache directory could not be created! This disables kernel caching. Specified directory is /people/holl433/.cache/torch/kernels. This warning will appear only once per process. (Triggered internally at /opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/jit_utils.cpp:860.)
ll_loss = torch.digamma(a_ans) - torch.digamma(a_zero)

Here is a script to reproduce the error:

import torch
from torch import nn
from torch import cuda

class BMLoss(nn.Module):
def init(self, coeff, prior=1.):
super(BMLoss, self).init()
self.prior = prior
self.coeff = coeff

def forward(self, logits, targets):
    '''
    Compute loss: kl - evi
    
    '''
    alphas = torch.exp(logits) 
    betas = torch.ones_like(logits) * self.prior
    
    # compute log-likelihood loss: psi(alpha_target) - psi(alpha_zero)
    a_ans = torch.gather(alphas, -1, targets.unsqueeze(-1)).squeeze(-1)
    a_zero = torch.sum(alphas, -1)
    ll_loss = torch.digamma(a_ans) - torch.digamma(a_zero)
    
    # compute kl loss: loss1 + loss2
    #       loss1 = log_gamma(alpha_zero) - \sum_k log_gamma(alpha_zero)
    #       loss2 = sum_k (alpha_k - beta_k) (digamma(alpha_k) - digamma(alpha_zero) )
    loss1 = torch.lgamma(a_zero) - torch.sum(torch.lgamma(alphas), -1)
    
    loss2 = torch.sum(
        (alphas - betas) * (torch.digamma(alphas) - torch.digamma(a_zero.unsqueeze(-1))),
    -1)
    kl_loss = loss1 + loss2
    
    loss = ((self.coeff*kl_loss - ll_loss)).mean()
    
    return loss

device = torch.device(“cuda:0”) if cuda.is_available() else torch.device(“cpu”)
bm_loss=BMLoss(coeff=0.01)
bm_loss=bm_loss.to(device)
inputs=torch.randn(4,6).to(device,dtype=torch.float)
targets=torch.tensor([0,1,0,3]).to(device,dtype=torch.long)
loss=bm_loss(inputs,targets)

I don’t get the warning message if I run the loss function on a cpu, only when I try to run it with cuda do I get an issue. Did I miss something in my implementation of the loss function that is causing the warning to pop up? Do I need to do something to make the loss function cuda compatible? I am using version 1.11.0 of PyTorch and version 11.3.1 of cudatoolkit, and version 11.4 of cuda.

ptrblck · November 19, 2022, 6:43pm

Could you check if your Python process has write permissions to /people/holl433/.cache/torch/kernels? If not, allow the process to write cache files to this folder or change the path via PYTORCH_KERNEL_CACHE_PATH to a directory which your process has access to.
In any case, this is triggered internally and will just JIT the kernel in the next execution again instead of using the cached kernel.

Andrew_Hollis · November 19, 2022, 7:47pm

Thanks for the speedy reply! Do I set the value of PYTORCH_KERNEL_CACHE_PATH in my script? Or do I need to access and change it some other way? Also, am I correct in thinking that whether or not the cached kernel is used does not change the outcome of the learning process it just effects the speed?

ptrblck · November 20, 2022, 3:28am

Sorry for not being clear enough. PYTORCH_KERNEL_CACHE_PATH is an environment variable and you can set it in linux via:

export PYTORCH_KERNEL_CACHE_PATH=/your/desired/path

in the terminal where you would run your Python script.

Yes, you are correct. If the cache isn’t accessible it would mean that the next script execution would need to recompile the kernel using NVRTC (NVIDIA runtime compiler) and would add a small overhead. The actual training workload will not change.

Andrew_Hollis · November 20, 2022, 3:49am

No problem, I am very new. Thanks for the clarification. Setting the environment variable to an accessible directory worked perfectly. Thank you again for the help!