Thought I should follow up. I should say, I am using googlenet in 2024 because I am interested in playing with Deep Dream rather than making Deep Fakes.
I was measuring these metrics:
- GPU shader utilization using nvidia-smi or rocm-smi.
- The “Time” per minibatch value in the script output.
- The SSD read speed (rkB/s) and average queue length (aqu-sz) reported by iostat (example “iostat -p /dev/nvme2n1p1 -dx 1”).
Initially, when I originally asked, shader utilization was varying between 0% and 100%, the script was reporting Time values from a fraction of a second to several seconds, and iostat reported large swings in read speed.
Batch size for the Radeon VII Pro and L4 was 64 and for the L40s it was 256.
I used torch.utils.bottleneck on the Radeon Pro VII and ROCm was not very happy about that, many errors, but I did get enough out of it to suspect the image loader was the culprit.
I found the example script defaulted to four workers for the image loader. Changing that had an immediate effect.
For the Radeon Pro VII in a local machine eight workers was the sweet spot, resulting in around ~99% shader utilization and rock steady Time value.
For a quad L4 in AWS (g6.12xlarge) I found 16 workers were needed to get the metrics stable, but shader utilization would not go over ~90% on a sliding window.
For a single L40s in AWS (g6e.2xlarge) I found 32 workers were needed to get the metrics stable, but shader utilization would not go over ~90% on a sliding window.
For a single L40s in AWS (g6e.8xlarge) I found 32 workers were needed to get the metrics stable, and shader uilization was stable at 99%.
For a quad L40s ins AWS (g6e.24xlarge) I was unable to get the shader utilization over about 80%, however I could get a stable Time value.
I looked at whether the SSD used to store the images might be the culprit. Not so clear. None of the setups generated bandwidth that should stress a decent SSD (326 MB/s at the highest), but I am not at all sure how to interpret the queue length data. Very short queues where the number of data workers was too low, expected. For Radeon Pro VII, L4 and L40s when I had enough workers to max out the shaders the SDD queue depth would be in the 0.22 to 2.7 range. For the quad L40s system that I just could not max out the SSD queue depth was modest, in around 0.8. But I have no idea what to make of AWS’ SSDs - how many bits of intervening DPUs or networking layers are actually between the CPU/GPU and the SSD.
What could be clearly observed was the SSD read rates (r/s and rkB/s) for any GPU setup scaled up as the number of workers was increased, up to the limit where shader utilization leveled off. A fully utilized Radeon VII Pro was sucking in 96MB/s of images, while a ~66% utilized quad L40s sucked in 326 MB/s.