Can we have a utility which moves the ExportedProgram
from GPU to CPU ? My usecase is I want to run exp_program = torch.export.export(model, inputs)
on GPU (since it is faster) and later move the exp_program to CPU to preserve the GPU memory for other tasks (TensorRT compilation). Something like exp_program.to("cpu")
would help. If there’s another way to do this currently, please share. Thank you
Isn’t exporting your model something you only do once, and then run the exported model many times. If that is the case, maybe its just better to trace on CPU even if it is slower?
How much slower is the exporting on CPU? Taking much longer would unexpected to me since you’re tracing with real tensors either way.
I have a script which initializes a Pytorch model, exports it and then converts to TensorRT. In my case, I have to run this script and export it multiple times for debugging purposes. This could in theory be split into two scripts but that’s not very convenient.
How much slower depends on the model. In my case of Flux 12B model, I see a considerable difference. Having .to
utility on ExportedPrograms to move then to CPU and release the GPU memory would be helpful in general.
Is there a reason you expect the export process faster to be faster on the gpu than the cpu?
Export doesn’t actually do any compute (it traces your program into a graph with symbolic execution, but doesn’t need to actually run any kernels). So I would be surprised if export ran faster when your model is on the gpu.
One reason that adding an ExportedProgram.to() API is a bit sketchy is that we don’t promise to trace your model in a device-generic way. For example, given a function like this:
def f(x):
y = x.to(device=“cuda”)
y.mul_(2)
return x + 1
This function will end up mutating its input if you pass in a cuda tensor, but it won’t if you pass in a cpu tensor (.to is a “maybe aliasing” operation, since it returns the input directly if no conversion is needed). Export will end up specializing on the aliasing behavior when it traces with the particular input tensors you passed in.
One thing to call out is that if you want to export a model for GPU but you can’t do it from a GPU-enabled machine, you can instantiate your model with cuda FakeTensors