Hi, I have a weird question about some crazy idea I want to materialize using PyTorch.
The problem: I want to train re-id model using multiple datasets. Such a solution already exists in https://github.com/KaiyangZhou/deep-person-reid and it works.
However, I beleive it is not optimal, as in this repo they combine the datasets, sums all unique entities and relabel all of them. Example: I have 3 datasets with 50,000 unique entities each. When using
deep-person-reid I will get 150,000 unique entities at the end and the last softmax layer will have a size of 150,000 neruons.The problem is that such approach causes explosion of the dimension of the softmax layer.
In my case I have almost 1 million unique entities, which casue huge GPU RAM overhead just to store the model! (no worries, my entities are not human people )
To alleviate the problem I came up with a crazy (I think it is a little bit crazy).
(For all examples let’s assume I have 3 different datasets, with 50,000 entities and image sizes etc. are consistent across all datasets)
Create separate dataloaders, which will be used in turns during training. (3 seperate dataloaders)
Create removable softmax layer that would be used in conjuntion with a specific dataset/dataloader. (3 seperate softmax layers)
Create separate (?) optimizers that would be used in conjuction with a specific softmax layer. (3 separate optimizers)
In such a setup the training would look like this:
0. Move model (without softmax layer), criterion etc. to GPU.
Get 1st dataloder & optimizer.
Append 1st softmax layer to the model.
Remove 1st softmax layer from model and from GPU
Get 2nd dataloader & optimizer
Append 2nd softmax layer to the model.
Remove 2nd softmax … and so on.
I see here 2 main challenges:
I could not find a clear explanation how optimazers works under-the-hood, but I heard that they are somehow binded to all modules in the model.
I assume there should be also 3 seperate optimizers that would be dedicated to each softmax layer. Moreover, each optimizer should be disabled during not-their-training cycle
(so that not accumulate gradient when their dedicated softmax layer is not used in current training cycle).
Maybe write a custom optimizer that would allow to freeze its parameters when needed?
What would be a sensible way to approch this challenge?
Not sure how to deal with the problem of replacing softmax layer during training.
My instant idea was to just choose desired softmax layer in forward function, but then other softmax layers would be still laying on the GPU RAM,
which would be the same effectively as implemented in deep-person-reid, thus it would not solve the main problem of consuming too much GPU RAM.
Second thought was to rather modify the model architecture - replace the softmax layer only (with corresponding weights ofc), however I am not
sure how doble it is on-the-fly.
Moreover there will be some performance overhead with this approach probably.
Hope, I explained the idea clearly enough. I am more than happy to explain it more if needed
Most of all I hope someone can direct me to docs/tutorial/something else that would give me more insight in the inner working of PyTorch, which help me to solve my problem.
You pass (some) parameters to the optimizer, which are updated using their gradients (stored in the .grad attribute of each parameter). If you don’t pass certain parameters to an optimizer, they won’t be updated in any way. Since softmax does not contain any parameters, your current idea won’t work unfortunately.
If a layer is not used during the forward pass, it won’t get any gradients in the backward pass. You could therefore use an if condition and use different layers in the forward pass. Only the selected (and used) layer will be included in the backward pass. This will add some overhead, but you could transfer all unused layers back to the CPU to free the GPU memory for the current one.
I would generally “Go for it!”, especially after reading your first sentence
However, even if you skip the idea of using different softmax layers (due to the missing parameters) and instead would use e.g. different linear layers, I’m not sure how you would chose between them in the test case, i.e. if you are dealing with new samples, which are not coming from the three predefined Datasets.
Thanks for a quick reply. I really counted you will reply ptrblck
Maybe I did not explain this part well enough or I do not understand your reply properly.
Moreover, maybe having 3 datasets with 50,000 entities each in my example was wrong decision. Datasets may have any number of entities, so the size of the softmax layers can vary a lot - that’s why using multiple softmax layers seems a must.
Of course softmax has no parameters, so I am not willing to update softmax layer with backprop. As I assumed that optimizers are ‘binded’ to all modules in the model, so I reckoned that for each model version (that are different only in softmax layer) should have a separate optimizer, as the model would be different due to different softmax layers.
However, after reading your reply it occurred to me that maybe there is an easier approach. As softmax has no parameters, can I have optimizer binded to all layers of the model but for the softmax layer that would optimize all models’ versions parameters? So one optimizer for all models with different softmax layer?
This gives me a bit of hope, but I am confused how the GPU memory can be freed with this ‘froward approach’. If the model is loaded into GPU, all softmax layers will be sitting there as well all the time, even if they are not used in the current forward pass, am I right?
The problem is, that during the forward pass only softmax layer would be changing, which does not have any parameters anyway. I can’t see any savings here.
Thanks! I will try, maybe asking a couple of more questions