I am using CUDA graph to train multiple models. I believe the graph captures certain memory spaces, so if I need to train another network using the same graph, I have to use the same memory space of the model and optimizer. My question is how can I create a new model/optimizer in the same memory space, or is there a way to reset the model and the optimizer without changing the memory space?
Here is how I created the model and optimizer:
std::shared_ptr< UNet> unet_model = std::make_shared< UNet>(in_chn, out_chn, fea);
I don’t understand what this exactly means. The graph capture will use its own internal memory pool. You can capture any models you want as long as you have enough memory. If previous models are not needed anymore you can delete them.
Hi @ptrblck , I have a question about using a recorded training graph. If I record the training process of model_0 to build a CUDA graph, can I reuse the same graph to train model_1?
From my understanding, the graph seems to record the memory addresses associated with model_0, so during replay, it always references those specific addresses. To train model_1, would I need to copy model_1 to the same GPU memory address used by model_0 for the graph to work? Is this understanding correct?
No, as your explanation about the captured memory addresses is correct. Replaying the graph will use the same operations and the same memory locations pointing to model_0.
A safe option would be to delete model_0 including its capture and to recapture model_1. However, you could also try to experiment with inplace copies (via tensor.copy_) to copy all parameters from model_1 into model_0 (I haven’t tried it and don’t know if I’m missing any obvious limitation).
Thanks, I forgot to mention that model_0 and model_1 share the same architecture, the only difference is their weights. Currently, I am training 4096 identical models, each with different weights. Since each model is relatively small (just a few hundred parameters), I believe reusing the graph could significantly improve performance, as capturing graphs for 4096 models might cause a lot of overhead.
My initial thought was to reuse the GPU address by creating a new model in the captured memory location, then after training copy it elsewhere, and then load the next model to repeat the process. However, if this approach isn’t feasible, I will explore the idea of copying tensors as you suggested.