Performance: several runs

vsant · December 1, 2021, 3:43pm

Hello,

I have a bash script in which I train and test several networks/models. Such as:

python model_code.py 1 
python model_code.py 2
...
python model_code.py 10

The parameter value (1, 2, …) identifies a different type of CNN. I start with CNNs with less layers (e.g. ResNet-18) and goes on increasing the number of layers. I have noticed that as the networks/models are being called in, training is getting much slower than usual. In other words, if I run model 6 alone (not after 5 previous networks have been called), it trains faster. Has anyone experienced this problem? Is this procedure above (calling multiple networks in sequence in a script) not suitable?

Thank you.

ptrblck · December 2, 2021, 7:38am

Since the scripts are executed sequentially the order shouldn’t make a difference.
However, your machine might be running into e.g. thermal issues and might thus be lowering the performance to avoid overheating. Check the health status of your workstation especially the temperature sensors and frequencies of the CPU, GPU, etc.

vsant · December 2, 2021, 1:17pm

Thank you for your answer. I will do it. Another problem related to this context is the following. I now have two of these scripts, each one will call 10 models in sequence. I am running one script in one GPU and another one in another GPU. Let us say that script/GPU1 one is running model 5 and script/GPU2 is running model 3 (I started GPU1 first). Since I save the best model during training to load later in the inference/test phase (as it is usual), a problem like that happens:

size mismatch for fc1.0.weight: copying a param with shape torch.Size([2, 3072]) from checkpoint, the shape in current model is torch.Size([2, 3232]).

It is clear to me that the problem is that, after finishing training model 5 (GPU1), it loads the model from network 3 (GPU2). I am just wondering how to handle this checkpoint issue. Do I need to resort in a parallel solution in this way?

Thank you again.

ptrblck · December 3, 2021, 12:11am

I don’t know how your workload is supposed to work exactly, but based on the description it seems that you would like to train multiple models and load the most recent checkpoint?
If so, it seems that you are using different models which might be causing the shape mismatch error. In that case I would assume that you would like to store the checkpoints for each model separately using a different name or folder.

vsant · December 3, 2021, 6:28pm

Thank you again for your answer. Yes, the idea is to train two different models (A, B) in two GPUs (1, 2) at the same time but load the best model obtained during training. Hence, I want to load the best correct model of A naturally into a new instance of A in GPU1. The same applied to B. I saved and loaded models as suggested here but I did not save and load a general checkpoint.

The models (A, B) were saved into different folders to avoid issues. After analysing a little more, I think that the problem is likely to be related to one configuration file which has all values of parameters the programs need (collected during runtime).