I want to train many models in my GPU cards.
Each model has a big fixed embedding matrix and each model is trained separatly (not train a model in multi-GPUs).
In order to place more models in one card. I have to optimize memory cost in my cards.
So I am wondering, is there a way to share the biggest CUDA.Tensor (my embedding matrix) in different processes so that there exists only one copy of this matrix in every card?
looking forward for solutions and suggestions.