ZeroDivisionError: float division by zero (RetinaNet on custom dataset)

Intro

Hi, I want to train RetinaNet PyTorch on a custom dataset in coco format (https://github.com/pytorch/vision/tree/main/references/detection).

Structure

I have the following folder and file structure:

DATASET FOLDER:

annotations folder, train folder, val folder

annotations folder:

instances_train.json, instances_val.json

train folder:

images folder, labels folder

images folder:

1.png
3.png
7d938bce-cda0-4814-b938-6d0467f51afb_50.png
7d938bce-cda0-4814-b938-6d0467f51afb_59.png

labels folder:

1.txt
3.txt
7d938bce-cda0-4814-b938-6d0467f51afb_50.txt
7d938bce-cda0-4814-b938-6d0467f51afb_59.txt

val folder:

images folder, labels folder

images folder:

15.png
24.png
7d93fb50.png
7d9359.png

labels folder:

15.txt
24.txt
7d93fb50.txt
7d9359.txt

Command

When I run the following command:
torchrun --nproc_per_node=1 train.py --dataset coco --model retinanet_resnet50_fpn --epochs 3 --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01 --weights-backbone ResNet50_Weights.IMAGENET1K_V1

Problem

I am facing this problem: ZeroDivisionError: float division by zero

(retinaNet) egorundel@egorundel-B560M-H:~/projects/vision/references/detection$ torchrun --nproc_per_node=1 train.py --dataset coco --model retinanet_resnet50_fpn --epochs 3 --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01 --weights-backbone ResNet50_Weights.IMAGENET1K_V1
| distributed init (rank 0): env://
Namespace(amp=False, aspect_ratio_group_factor=3, backend='pil', batch_size=2, data_augmentation='hflip', data_path='/home/egorundel/data/NAMI_data_coco_without_subfolders/', dataset='coco', device='cuda', dist_backend='nccl', dist_url='env://', distributed=True, epochs=3, gpu=0, lr=0.01, lr_gamma=0.1, lr_scheduler='multisteplr', lr_step_size=8, lr_steps=[16, 22], model='retinanet_resnet50_fpn', momentum=0.9, norm_weight_decay=None, opt='sgd', output_dir='.', print_freq=20, rank=0, resume='', rpn_score_thresh=None, start_epoch=0, sync_bn=False, test_only=False, trainable_backbone_layers=None, use_copypaste=False, use_deterministic_algorithms=False, use_v2=False, weight_decay=0.0001, weights=None, weights_backbone='ResNet50_Weights.IMAGENET1K_V1', workers=4, world_size=1)
Loading data
loading annotations into memory...
Done (t=0.97s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: []
Creating model
Start training
Traceback (most recent call last):
  File "train.py", line 334, in <module>
    main(args)
  File "train.py", line 309, in main
    train_one_epoch(model, optimizer, data_loader, device, epoch, args.print_freq, scaler)
  File "/home/egorundel/projects/vision/references/detection/engine.py", line 27, in train_one_epoch
    for images, targets in metric_logger.log_every(data_loader, print_freq, header):
  File "/home/egorundel/projects/vision/references/detection/utils.py", line 200, in log_every
    print(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)")
ZeroDivisionError: float division by zero
[2023-11-07 16:21:15,565] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 30250) of binary: /home/egorundel/venvs/retinaNet/bin/python
Traceback (most recent call last):
  File "/home/egorundel/venvs/retinaNet/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
  File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351, in wrapper
    return f(*args, **kwargs)
  File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-07_16:21:15
  host      : egorundel-B560M-H
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 30250)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System

OS: Ubuntu 20.04
GPU: RTX 3060
torch: 2.2.0.dev20231107+cu118
torchvision: 0.17.0.dev20231107+cu118
torchaudio: 2.2.0.dev20231107+cu118

Question

What could be the problem?

In the source codes, I made the appropriate changes for my custom dataset:

I have 13 classes in my dataset, so I corrected 91 to 13 here.
vision/references/detection/train.py:

...
def get_dataset(is_train, args):
    image_set = "train" if is_train else "val"
    num_classes, mode = {"coco": (13, "instances"), "coco_kp": (2, "person_keypoints")}[args.dataset]
    with_masks = "mask" in args.model
    ds = get_coco(
        root=args.data_path,
        image_set=image_set,
        transforms=get_transform(is_train, args),
        mode=mode,
        use_v2=args.use_v2,
        with_masks=with_masks,
    )
    return ds, num_classes
...

I also changed train2017 and val2017 to train and val.

vision/references/detection/coco_utils.py:

...
def get_coco(root, image_set, transforms, mode="instances", use_v2=False, with_masks=False):
    anno_file_template = "{}_{}.json"
    PATHS = {
        "train": ("train", os.path.join("annotations", anno_file_template.format(mode, "train"))),
        "val": ("val", os.path.join("annotations", anno_file_template.format(mode, "val"))),
        # "train": ("val", os.path.join("annotations", anno_file_template.format(mode, "val")))
    }
...

But the error still remained.