Error when load network on GPU libTorch 2.5.1

weitaoliu · December 4, 2024, 10:23am

Hello all,

I have an error when loading the network on GPU. I link the libTorch 2.5.1 with CUDA 12.4 to my codes. It works fine on my local machine (with GTX 1070) and a testing machine (with RTX 4070 Ti). However, when I move the codes to a computing node with A100, the solver does not work and throws an error when loading the network. The error looks like the following:

terminate called after throwing an instance of 'c10::Error'
  what():  _ivalue_ INTERNAL ASSERT FAILED at "XXPath_To_Codes/ThirdParty/libtorchCUDA/include/torch/csrc/jit/api/object.h":38, please report a bug to PyTorch. 
Exception raised from _ivalue at XXPath_to_Codes/ThirdParty/libtorchCUDA/include/torch/csrc/jit/api/object.h:38 (most recent call first):

I set up the same environment (cuda driver) on the computing node and have no idea how to address this issue. Do you have any suggestions?

If I choose to load on the CPU, then there is no problem.

Note that the A100 is split into 7 instances. I don’t know if that could be the issue.

Thanks for your time.

Best, Weitao.