Intro
Hi, I want to train RetinaNet PyTorch on a custom dataset in coco format (https://github.com/pytorch/vision/tree/main/references/detection).
Structure
I have the following folder and file structure:
DATASET FOLDER:
annotations folder, train folder, val folder
annotations folder:
instances_train.json, instances_val.json
train folder:
images folder, labels folder
images folder:
1.png
3.png
7d938bce-cda0-4814-b938-6d0467f51afb_50.png
7d938bce-cda0-4814-b938-6d0467f51afb_59.png
…
labels folder:
1.txt
3.txt
7d938bce-cda0-4814-b938-6d0467f51afb_50.txt
7d938bce-cda0-4814-b938-6d0467f51afb_59.txt
…
val folder:
images folder, labels folder
images folder:
15.png
24.png
7d93fb50.png
7d9359.png
…
labels folder:
15.txt
24.txt
7d93fb50.txt
7d9359.txt
…
Command
When I run the following command:
torchrun --nproc_per_node=1 train.py --dataset coco --model retinanet_resnet50_fpn --epochs 3 --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01 --weights-backbone ResNet50_Weights.IMAGENET1K_V1
Problem
I am facing this problem: ZeroDivisionError: float division by zero
(retinaNet) egorundel@egorundel-B560M-H:~/projects/vision/references/detection$ torchrun --nproc_per_node=1 train.py --dataset coco --model retinanet_resnet50_fpn --epochs 3 --lr-steps 16 22 --aspect-ratio-group-factor 3 --lr 0.01 --weights-backbone ResNet50_Weights.IMAGENET1K_V1
| distributed init (rank 0): env://
Namespace(amp=False, aspect_ratio_group_factor=3, backend='pil', batch_size=2, data_augmentation='hflip', data_path='/home/egorundel/data/NAMI_data_coco_without_subfolders/', dataset='coco', device='cuda', dist_backend='nccl', dist_url='env://', distributed=True, epochs=3, gpu=0, lr=0.01, lr_gamma=0.1, lr_scheduler='multisteplr', lr_step_size=8, lr_steps=[16, 22], model='retinanet_resnet50_fpn', momentum=0.9, norm_weight_decay=None, opt='sgd', output_dir='.', print_freq=20, rank=0, resume='', rpn_score_thresh=None, start_epoch=0, sync_bn=False, test_only=False, trainable_backbone_layers=None, use_copypaste=False, use_deterministic_algorithms=False, use_v2=False, weight_decay=0.0001, weights=None, weights_backbone='ResNet50_Weights.IMAGENET1K_V1', workers=4, world_size=1)
Loading data
loading annotations into memory...
Done (t=0.97s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: []
Creating model
Start training
Traceback (most recent call last):
File "train.py", line 334, in <module>
main(args)
File "train.py", line 309, in main
train_one_epoch(model, optimizer, data_loader, device, epoch, args.print_freq, scaler)
File "/home/egorundel/projects/vision/references/detection/engine.py", line 27, in train_one_epoch
for images, targets in metric_logger.log_every(data_loader, print_freq, header):
File "/home/egorundel/projects/vision/references/detection/utils.py", line 200, in log_every
print(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)")
ZeroDivisionError: float division by zero
[2023-11-07 16:21:15,565] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 30250) of binary: /home/egorundel/venvs/retinaNet/bin/python
Traceback (most recent call last):
File "/home/egorundel/venvs/retinaNet/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351, in wrapper
return f(*args, **kwargs)
File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/egorundel/venvs/retinaNet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-07_16:21:15
host : egorundel-B560M-H
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 30250)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
System
OS: Ubuntu 20.04
GPU: RTX 3060
torch
: 2.2.0.dev20231107+cu118
torchvision
: 0.17.0.dev20231107+cu118
torchaudio
: 2.2.0.dev20231107+cu118
Question
What could be the problem?