I am trying to use deepspeed to do distillation; I initialize the student model using the deepspeed parameters, and I want a separate GPU to load the teacher model (this model is sufficiently big that I can’t fit both student and teacher in one GPU). The issue is I can’t place the teacher model on a GPU deepspeed is already using, because it has its own process to split the models.

I want to limit deepspeed to a set of GPUs by setting CUDA_VISIBLE_DEVICES=0,1,2, for example, and then reset os.environ[‘CUDA_VISIBLE_DEVICES’]=0,1,2,3 later in the code to allow the teacher model to be placed on a fresh, empty GPU3. But this gives me an error: " DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at" or “CUDA invalid device ordinal”

So is it possible to reset CUDA_VISIBLE_DEVICES later in the code?

I don’t think it’s safe/supported to reset CUDA_VISIBLE_DEVICES in the middle of script execution (e.g., after CUDA has been initialized): Os.environ ["CUDA_VISIBLE_DEVICES"] not functioning - #4 by albanD

If you want to restrict the number of GPUs deepspeed is using, I would check if e.g., setting the world_size parameter or something like nproc_per_node could achieve this.