Illegal memory access

DoKyung_Lim · September 5, 2020, 2:58pm

i have an error after make one loss func

class Latent_Classifier(nn.Module):
    def __init__(self):
        super(Latent_Classifier, self).__init__()
        
        self.encoder = nn.Sequential(
            nn.Linear(128, 750),
            nn.LeakyReLU(0.2),
            nn.Linear(750, 750),
            nn.Linear(750, 1)
        )
        
    def forward(self,latent_z):
        x1 = self.encoder(latent_z)
        print(x1.size())
        _eps = 1e-15
        loss = -(x1 + _eps).log().mean()-(1 - x1 + _eps).log().mean()
        
        return loss

i use this func as

classifier = Latent_Classifier()

f_classifier = classifier(latent_f)
lm_classifier = classifier(latent_l)

loss = 4000*(f_loss + m_loss) + 30 * (f_classifier + lm_classifier) + 2000 * lm_loss

loss.backward()

in loss.backward() i got an error msg

CUDA error: an illegal memory access was encountered

before using classifier loss, i have no error msg

is an error in function Latent_Classifier?

when i executed it using ‘torch.device(“cpu”)’ not cuda:0 it works well

ptrblck · September 5, 2020, 6:59pm

Could you rerun the script with:

CUDA_LAUNCH_BLOCKING=1 python script.py args

and post the stack trace here, please?
The illegal memory access might have been created by a previous CUDA operation and your loss could be a red herring.

DoKyung_Lim · September 5, 2020, 9:59pm

it’s my traceback message

Traceback (most recent call last):
File “train2.py”, line 121, in
loss.backward()
File “/home/hhhoh/.local/lib/python3.6/site-packages/torch/tensor.py”, line 18 5, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/hhhoh/.local/lib/python3.6/site-packages/torch/autograd/init.p y”, line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered
Exception raised from copy_kernel_cuda at /pytorch/aten/src/ATen/native/cuda/Cop y.cu:200 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f130683 e1e2 in /home/hhhoh/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x1e63b08 (0x7f1308b0ab08 in /home/hhhoh/.local/l ib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xc282b9 (0x7f13424cf2b9 in /home/hhhoh/.local/li b/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0xc25f28 (0x7f13424ccf28 in /home/hhhoh/.local/li b/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x44 (0x7f13 424cf144 in /home/hhhoh/.local/lib/python3.6/site-packages/torch/lib/libtorch_cp u.so)
frame #5: at::Tensor::copy_(at::Tensor const&, bool) const + 0x115 (0x7f1342bba0 95 in /home/hhhoh/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: + 0x37e647e (0x7f134508d47e in /home/hhhoh/.local/l ib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::Tensor::copy_(at::Tensor const&, bool) const + 0x115 (0x7f1342bba0 95 in /home/hhhoh/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, boo l, c10::optionalc10::MemoryFormat) + 0xb54 (0x7f134270b564 in /home/hhhoh/.loc al/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0x128850a (0x7f1342b2f50a in /home/hhhoh/.local/l ib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: + 0x2e749da (0x7f134471b9da in /home/hhhoh/.local/ lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: + 0x10ea412 (0x7f1342991412 in /home/hhhoh/.local/ lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::Tensor::to(c10::TensorOptions const&, bool, bool, c10::optional<c 10::MemoryFormat>) const + 0x146 (0x7f1342bedf56 in /home/hhhoh/.local/lib/pytho n3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: + 0x336a970 (0x7f1344c11970 in /home/hhhoh/.local/ lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::aut ograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std:: shared_ptrtorch::autograd::ReadyQueue const&) + 0x3fd (0x7f1344c173fd in /home /hhhoh/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd: :GraphTask> const&) + 0x451 (0x7f1344c18fa1 in /home/hhhoh/.local/lib/python3.6/ site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::auto grad::ReadyQueue> const&, bool) + 0x89 (0x7f1344c11119 in /home/hhhoh/.local/lib /python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::python::PythonEngine::thread_init(int, std::shared_p trtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7f13523b14ba in /home/hh hoh/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #18: + 0xbd6df (0x7f135350d6df in /usr/lib/x86_64-linux -gnu/libstdc++.so.6)
frame #19: + 0x76db (0x7f13559496db in /lib/x86_64-linux-gnu/ libpthread.so.0)
frame #20: clone + 0x3f (0x7f1355c82a3f in /lib/x86_64-linux-gnu/libc.so.6)

ptrblck · September 7, 2020, 2:54am

The illegal memory access was most likely triggered before the copy kernel, so the blocking launch is apparently not working.
Could you post an executable code snippet, which would reproduce this issue?

DoKyung_Lim · September 7, 2020, 4:23pm

Do you mean all code related execution?

ptrblck · September 7, 2020, 6:38pm

No, if possible narrow down the minimal code snippet, which reproduces the error.
I.e. remove all data loading, metric calculation etc., use random inputs and try to isolate the illegal memory access to a few lines.
What’s currently hard to debug is that your code apparently seems to run fine on the CPU and that the blocking launch isn’t properly working in your setup.

DoKyung_Lim · September 8, 2020, 12:40pm

i change classifier to Latent_Classifier().to(device) and error is cleared