As pointed out in https://github.com/pytorch/examples/issues/164, the imagenet training gets almost zero gpu utilization, I have 4 titan-V gpu, and the data is locally stored, although not on SSD, but my disk loading is high enough that should not be so slow at least, I posted a issue on github, but no response for several days, and the previous post was not solved, I tried 8 workers and 20 workers, gpu usage are all low, for 8 workers:
Epoch: [20][100/5005] Time 0.283 (1.396) Data 0.001 (1.006) Loss 2.5632 (2.3716) Prec@1 48.438 (47.486) Prec@5 67.969 (72.424)
Epoch: [20][110/5005] Time 0.239 (1.390) Data 0.001 (1.013) Loss 2.4275 (2.3647) Prec@1 49.609 (47.646) Prec@5 71.484 (72.579)
Epoch: [20][120/5005] Time 3.725 (1.422) Data 3.495 (1.056) Loss 2.0656 (2.3711) Prec@1 53.906 (47.556) Prec@5 75.781 (72.382)
Epoch: [20][130/5005] Time 3.500 (1.427) Data 3.267 (1.069) Loss 2.4683 (2.3707) Prec@1 45.312 (47.442) Prec@5 68.750 (72.343)
Epoch: [20][140/5005] Time 0.228 (1.396) Data 0.001 (1.046) Loss 2.2713 (2.3637) Prec@1 50.781 (47.565) Prec@5 72.266 (72.407)
for 20 workers:
Epoch: [20][530/5005] Time 0.227 (0.781) Data 0.001 (0.507) Loss 2.4582 (2.3638) Prec@1 42.969 (47.641) Prec@5 68.359 (72.389)
Epoch: [20][540/5005] Time 0.317 (0.772) Data 0.001 (0.498) Loss 2.2743 (2.3646) Prec@1 48.438 (47.633) Prec@5 75.000 (72.362)
Epoch: [20][550/5005] Time 0.225 (0.786) Data 0.001 (0.511) Loss 2.1320 (2.3634) Prec@1 50.000 (47.668) Prec@5 76.172 (72.384)
Epoch: [20][560/5005] Time 0.290 (0.776) Data 0.003 (0.502) Loss 2.4872 (2.3635) Prec@1 44.141 (47.633) Prec@5 67.969 (72.380)
Epoch: [20][570/5005] Time 0.250 (0.782) Data 0.002 (0.507) Loss 2.3034 (2.3634) Prec@1 47.266 (47.608) Prec@5 74.609 (72.364)
Epoch: [20][580/5005] Time 2.115 (0.776) Data 1.873 (0.502) Loss 2.3284 (2.3650) Prec@1 45.312 (47.570) Prec@5 72.656 (72.340)
Epoch: [20][590/5005] Time 0.399 (0.782) Data 0.002 (0.508) Loss 2.4217 (2.3645) Prec@1 45.703 (47.591) Prec@5 70.703 (72.348)
Epoch: [20][600/5005] Time 3.144 (0.778) Data 2.857 (0.504) Loss 2.3866 (2.3629) Prec@1 48.828 (47.632) Prec@5 71.875 (72.362)
Epoch: [20][610/5005] Time 0.236 (0.784) Data 0.002 (0.510) Loss 2.3191 (2.3638) Prec@1 51.953 (47.630) Prec@5 73.047 (72.362)
Epoch: [20][620/5005] Time 0.231 (0.776) Data 0.001 (0.502) Loss 2.4194 (2.3634) Prec@1 50.000 (47.652) Prec@5 71.875 (72.359)
Epoch: [20][630/5005] Time 0.298 (0.788) Data 0.001 (0.514) Loss 2.3440 (2.3624) Prec@1 47.266 (47.674) Prec@5 69.922 (72.368)
Epoch: [20][640/5005] Time 1.156 (0.782) Data 0.841 (0.507) Loss 2.5047 (2.3640) Prec@1 46.094 (47.629) Prec@5 69.531 (72.345)
Epoch: [20][650/5005] Time 0.230 (0.787) Data 0.002 (0.513) Loss 2.4881 (2.3637) Prec@1 46.484 (47.629) Prec@5 73.438 (72.354)
Epoch: [20][660/5005] Time 0.733 (0.780) Data 0.385 (0.506) Loss 2.3043 (2.3642) Prec@1 48.828 (47.620) Prec@5 74.219 (72.355)
Epoch: [20][670/5005] Time 0.222 (0.791) Data 0.001 (0.517) Loss 2.4218 (2.3640) Prec@1 50.000 (47.635) Prec@5 70.312 (72.358)
Epoch: [20][680/5005] Time 0.726 (0.784) Data 0.497 (0.510) Loss 2.0819 (2.3638) Prec@1 53.906 (47.653) Prec@5 75.391 (72.349)
Epoch: [20][690/5005] Time 0.224 (0.795) Data 0.002 (0.521) Loss 2.2428 (2.3634) Prec@1 49.219 (47.669) Prec@5 75.000 (72.358)
Epoch: [20][700/5005] Time 0.278 (0.787) Data 0.003 (0.513) Loss 2.4094 (2.3639) Prec@1 44.141 (47.653) Prec@5 70.312 (72.346)
Epoch: [20][710/5005] Time 0.436 (0.798) Data 0.003 (0.523) Loss 2.3120 (2.3633) Prec@1 50.000 (47.665) Prec@5 71.484 (72.351)
Epoch: [20][720/5005] Time 0.234 (0.790) Data 0.001 (0.516) Loss 2.5496 (2.3646) Prec@1 44.922 (47.650) Prec@5 69.141 (72.336)
Epoch: [20][730/5005] Time 0.232 (0.800) Data 0.001 (0.526) Loss 2.1596 (2.3641) Prec@1 51.953 (47.666) Prec@5 76.562 (72.350)
Epoch: [20][740/5005] Time 0.226 (0.793) Data 0.001 (0.519) Loss 2.4315 (2.3641) Prec@1 45.703 (47.657) Prec@5 71.094 (72.357)
Epoch: [20][750/5005] Time 0.244 (0.803) Data 0.001 (0.529) Loss 2.2962 (2.3637) Prec@1 45.703 (47.650) Prec@5 72.266 (72.376)
Epoch: [20][760/5005] Time 0.316 (0.796) Data 0.001 (0.522) Loss 2.4111 (2.3642) Prec@1 50.781 (47.631) Prec@5 72.656 (72.366)
Epoch: [20][770/5005] Time 0.245 (0.802) Data 0.001 (0.529) Loss 2.4344 (2.3643) Prec@1 48.828 (47.611) Prec@5 71.875 (72.360)
Epoch: [20][780/5005] Time 0.346 (0.795) Data 0.001 (0.522) Loss 2.3858 (2.3640) Prec@1 45.703 (47.617) Prec@5 71.094 (72.362)
Epoch: [20][790/5005] Time 0.290 (0.802) Data 0.002 (0.529) Loss 2.5051 (2.3643) Prec@1 44.922 (47.622) Prec@5 72.656 (72.356)
Epoch: [20][800/5005] Time 0.224 (0.795) Data 0.001 (0.522) Loss 2.2296 (2.3641) Prec@1 48.047 (47.624) Prec@5 74.219 (72.347)
Epoch: [20][810/5005] Time 0.239 (0.800) Data 0.002 (0.527) Loss 2.2256 (2.3643) Prec@1 49.609 (47.622) Prec@5 74.219 (72.345)
my pin_memory
is set to True
, my dataloader is configured as
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
num_workers=args.workers, pin_memory=True, sampler=train_sampler)
val_loader = torch.utils.data.DataLoader(
datasets.ImageFolder(valdir, transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
])),
batch_size=args.batch_size, shuffle=False,
num_workers=args.workers, pin_memory=True)
The only change I made is adding my own resnet101 module for training, without changing other part,
@ptrblck, I saw you have some suggestion in GPU: high memory usage, low GPU volatile-util, but I don’t know how to further check on the issue, could you pls help? thanks!