The training wait a long time after load model

chunchun · December 10, 2019, 9:03am

It remains this situation for a long time.
Could you please tell me why?

ptrblck · December 10, 2019, 6:43pm

We would need some more information about your use case to help you out.
What kind of model are you loading?
How long does the loading take?
Are you pushing the model to the GPU or is it still on the CPU?

chunchun · December 11, 2019, 12:19am

Thanks a lot for your attention.

This is the code that I want to repeat the training process.The model is SiamRPN,one of the trackers in the pysot.The losding takes for many hous and when I press ctrl+c,it shows as follows:

Although I use ctrl+c,the GPU still shows as follows:

The gpu-util remains 100%.
The training process is pushed on these three GPUs.And I just stcuk here.
I try some things as changing the num_worker to 0 or decrease the batchsize,both of witch don’t work.
Could you please tell me why?

ptrblck · December 11, 2019, 12:31am

Is this issue specific to this particular repository and can you execute some of our tutorials or do you see the same error?

chunchun · December 11, 2019, 12:49am

I think this issue is specific to this repository.
I have seen someone else have the same question in the github.
Could it be a problem with distributed training?I use 3 GPUs instead of 8.
sorry I have no idea.

chunchun · December 11, 2019, 3:14am

I try single GPU to run,it works.
Maybe the reason of this issue is from distributed_training.
I still haven’t found the real reason.

ptrblck · December 11, 2019, 4:07pm

Are you able to run the distributed example without any error?

chunchun · December 12, 2019, 12:12am

I will try it then give a repply.
Thank you again for your attention.

chunchun · December 27, 2019, 7:17am

I still can’t deal with it.
There is the same situation for the others in my lab.
And we don’t know why.