I am trying to use deepspeed to do distillation; I initialize the student model using the deepspeed parameters, and I want a separate GPU to load the teacher model (this model is sufficiently big that I can’t fit both student and teacher in one GPU). The issue is I can’t place the teacher model on a GPU deepspeed is already using, because it has its own process to split the models.
I want to limit deepspeed to a set of GPUs by setting CUDA_VISIBLE_DEVICES=0,1,2, for example, and then reset os.environ[‘CUDA_VISIBLE_DEVICES’]=0,1,2,3 later in the code to allow the teacher model to be placed on a fresh, empty GPU3. But this gives me an error: " DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at" or “CUDA invalid device ordinal”
So is it possible to reset CUDA_VISIBLE_DEVICES later in the code?