DDP training get slower than first few iteration

GEOLU · October 28, 2024, 11:27am

Now I am trying the scratch training of Swin transformer in torchvision models at here

Even though our environment is poor (i9-10980XE, RTX3090x4, 128GB RAM, Ubuntu 22.04, pytorch 2.4 cudatoolkit=12.1), the training progressively slow down and the GPU utilization also drop. CPU cores’ are also enough.

It first uses the GPU utils 100% but after a few iterations, it is reduced dramatically.

It does not maintain the training speed. Please give me some solutions.

I’ve tried num_worker increament, pin_memory True,

Starting GPU utils
GPU_utils_at_first

ptrblck · October 29, 2024, 7:27pm

Profile your code with a visual profiler, e.g. Nsight Systems, to narrow down where bottlenecks in your code are.

GEOLU · November 1, 2024, 5:47am

The code is the same as the HERE

I just run it in my environment with this command line

torchrun --nproc_per_node=4 train.py --model swin_t --epochs 300 --batch-size 128 --opt adamw --lr 0.001 --weight-decay 0.05 --norm-weight-decay 0.0  --bias-weight-decay 0.0 --transformer-embedding-decay 0.0 --lr-scheduler cosineannealinglr --lr-min 0.00001 --lr-warmup-method linear  --lr-warmup-epochs 20 --lr-warmup-decay 0.01 --amp --label-smoothing 0.1 --mixup-alpha 0.8 --clip-grad-norm 5.0 --cutmix-alpha 1.0 --random-erase 0.25 --interpolation bicubic --auto-augment ta_wide --model-ema --ra-sampler --ra-reps 4 --val-resize-size 224 --output-dir './' --print-freq 256--workers 4

I ran the code with workers 0, 4, 8, and 16 with 4 GPUS (nproc-per-node=4) settings

However, all the experiments are slowed after a few iterations.