Training runs infinitely without new epochs created

Hello everyone,
I try to train a MaskRCNN ResNet50 model with ~5000 images, using torchvision (based on this tutorial [TorchVision Object Detection Finetuning Tutorial — PyTorch Tutorials 1.10.1+cu102 documentation]). After one day with no return, I get this:

Epoch: [0]  [   0/1643]  eta: 1:02:35  lr: 0.000010  loss: 3.7346 (3.7346)  loss_classifier: 1.6762 (1.6762)  loss_box_reg: 0.1342 (0.1342)  loss_mask: 1.7567 (1.7567)  loss_objectness: 0.1513 (0.1513)  loss_rpn_box_reg: 0.0163 (0.0163)  time: 2.2856  data: 0.1504  max mem: 6379
Epoch: [0]  [ 200/1643]  eta: 0:52:33  lr: 0.001009  loss: 0.8649 (1.2449)  loss_classifier: 0.1609 (0.2664)  loss_box_reg: 0.1137 (0.1101)  loss_mask: 0.5193 (0.7439)  loss_objectness: 0.0410 (0.1019)  loss_rpn_box_reg: 0.0137 (0.0225)  time: 2.3259  data: 0.0890  max mem: 10659
Epoch: [0]  [ 400/1643]  eta: 0:47:08  lr: 0.002008  loss: 0.8563 (1.0653)  loss_classifier: 0.1598 (0.2176)  loss_box_reg: 0.1226 (0.1216)  loss_mask: 0.5100 (0.6313)  loss_objectness: 0.0387 (0.0746)  loss_rpn_box_reg: 0.0077 (0.0201)  time: 2.5439  data: 0.0874  max mem: 10659
Epoch: [0]  [ 600/1643]  eta: 1 day, 4:12:41  lr: 0.003007  loss: 0.7584 (0.9834)  loss_classifier: 0.1494 (0.1987)  loss_box_reg: 0.1150 (0.1239)  loss_mask: 0.4441 (0.5806)  loss_objectness: 0.0200 (0.0611)  loss_rpn_box_reg: 0.0075 (0.0191)  time: 443.5763  data: 0.1023  max mem: 10659

And then nothing happens (I tried waiting for several days), there is no error and no more training steps calculated.

When I type nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.156.00   Driver Version: 450.156.00   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID T4-16C         On   | 00000000:02:01.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |  16274MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2324      C   ...v/envs/default/bin/python    14054MiB |
+-----------------------------------------------------------------------------+

I tried reducing batch size or number of images but I get the same problem each time.

I use torch 1.10.1 and torchvision 0.11.2 and I get this from torch:
Capture2

Any idea what causes the problem ?
Thanks

Could you just use a single image in the Dataset (keep the Dataset, DataLoader etc.) and see if the code is still stuck at the end of the epoch?
If so, could you check if you are using any callbacks etc. at the end of an epoch and if this could cause the hang?

I quickly got these logs:

/opt/pythenv/envs/default/lib64/python3.6/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Epoch: [0]  [0/1]  eta: 0:00:00  lr: 0.005000  loss: 4.6190 (4.6190)  loss_classifier: 1.8476 (1.8476)  loss_box_reg: 0.0121 (0.0121)  loss_mask: 2.6329 (2.6329)  loss_objectness: 0.1016 (0.1016)  loss_rpn_box_reg: 0.0247 (0.0247)  time: 0.6683  data: 0.0355  max mem: 1506
Epoch: [0] Total time: 0:00:00 (0.6757 s / it)
creating index...
index created!

Then nothing happens, but I think it’s normal because after that it’s the evaluation part engine.evaluate() and I have nothing in my data_loader_val. I’m gonna check by adding one image in data_loader_val.

So it works perfectly fine with one image. The training started immediately and finished in a minute.
Could it be a problem with the images/annotations I give as input ? How could I check that ?

I don’t know what might be causing the issue, but based on your explanation it seems the only difference between the successful run and the hang is the size of the datasets?
If so, are you able to iterate both DataLoaders (without training) or is this also hanging?

So I think I figured out what was the issue. The NIVIDA license I used was expired and after renewing it, the training works again. However I don’t know how I was able to train a model with few images…

That’s interesting. What kind of “NVIDIA license” are you referring to?