Libtorch cpu incompatible with libtorch cuda

Robin_Lobel · March 6, 2020, 12:52pm

For training, I use the libtorch CUDA 10.1 package as it’s way faster than training on the CPU.
For deployment, I want to use the libtorch No CUDA (CPU only) package as it’s lighter and I don’t need GPU acceleration.
Official 1.4.0 packages in both cases, Windows.

However I just ran into an incompatibility issue. All the models I’ve trained using the CUDA package, and saved using torch::save(model,path) can’t be loaded with torch::load(model,path) in the CPU-only package. But the opposite is not true:

-Models trained with the CUDA package can be saved and loaded with the CUDA package but can’t be loaded by the CPU-only package: the software crash without any explanation when torch::load is called.
-Models trained with the CPU package can be saved and loaded with the CPU-only package and the CUDA package.

I’ve uploaded 2 files, one trained with the CPU-only package, one with the GPU package:
https://drive.google.com/open?id=1U0-5pto40Tn9n4EsoW1P56E9PI1dyEKP
Both use the exact same model/training data, saved after only one epoch. They are exactly the same size, and looking at the content in an hex editor, they seem to be structured the same way.

This is quite problematic. Are you aware of this issue ? Loading failed with the nightly build of CPU-only 1.5 as well.

tom · March 6, 2020, 12:57pm

What’s the error message message you are getting?
Were the models you serialized on GPU or CPU?

Best regards

Thomas

Robin_Lobel · March 6, 2020, 1:24pm

No error, the software just stop running. No crash dialog.

However I finally found the solution ! When I train a model using CUDA, I should send back my model to CPU using model.to(kCPU) before saving it, so that it’s compatible with the CPU-only package as well.

So I was able to convert my CUDA-trained models to CPU-trained models using the CUDA package by doing load(model, oldfile); model.to(kCPU); save(model,newfile);

Here’s the hex difference between the CUDA file and the CPU file, towards the end of the file:

Everything else in the file is strictly identical, just that block differs (notice “cpuq” vs “cuda” at the top right of the green block) and make the CPU-only package crash (with no error message) when trying to load the CUDA-saved file.

I’m a little surprised this is happening as I thought the model file only contained float values, with no affinity to the processing unit ?