Is there a forum for discussion of GPU hardware for PyTorch?

emerth · August 19, 2024, 10:10pm

I am reasonably knowledgeable with Caffe and pretty strong with Python and C++, but a fair newbie to PyTorch. I have questions about performance - and expected performance - of different hardware while training networks like ResNet34 using PyTorch. Is there a forum where the focus is on hardware for PyTorch and tuning for hardware, rather than programming with PyTorch per se?

Example: running github.com/pytorch/examples/tree/main/imagenet I get similar performance with an old Radeon Pro VII as an Nvidia L4, and an Nvidia A100 is only about 80% faster than either. But I think this is not the best place to be asking?

ptrblck · August 19, 2024, 11:04pm

You can just ask here.

To your question about performance of different devices: profile your workload to isolate where the bottleneck if your code is. E.g. if all use cases are CPU-limited, a better GPU will hardly improve the end2end performance.

emerth · August 20, 2024, 12:16am

LOL, I am sitting here trying to type an explanation that I tried permutations of PyTorch parameters, hardware, etc, and, and, and… and I just cannot seem to cause a change in the per-epoch training time except a weak scaling across GPU generations. Which is to say I have a… a… a bottleneck! So, you are very wise! Thankyou for the advice. I will get out a profiler.

merto58 · August 20, 2024, 7:53am

Hello sir I need help for my code.The torch doesnt work on my computer I dont know wht?? What can it be the problem I triesd to look on the dependency walker and it said torch is missing some dll files.How can I fix that??The error

is win126.

emerth · August 20, 2024, 1:22pm

It is failing to import torch. Have you installed PyTorch?

Look at this page to install: Start Locally | PyTorch

ptrblck · August 20, 2024, 3:33pm

Your issue is related to this one and unrelated to this topic.

emerth · November 8, 2024, 11:18pm

Thought I should follow up. I should say, I am using googlenet in 2024 because I am interested in playing with Deep Dream rather than making Deep Fakes.

I was measuring these metrics:

GPU shader utilization using nvidia-smi or rocm-smi.
The “Time” per minibatch value in the script output.
The SSD read speed (rkB/s) and average queue length (aqu-sz) reported by iostat (example “iostat -p /dev/nvme2n1p1 -dx 1”).

Initially, when I originally asked, shader utilization was varying between 0% and 100%, the script was reporting Time values from a fraction of a second to several seconds, and iostat reported large swings in read speed.

Batch size for the Radeon VII Pro and L4 was 64 and for the L40s it was 256.

I used torch.utils.bottleneck on the Radeon Pro VII and ROCm was not very happy about that, many errors, but I did get enough out of it to suspect the image loader was the culprit.

I found the example script defaulted to four workers for the image loader. Changing that had an immediate effect.

For the Radeon Pro VII in a local machine eight workers was the sweet spot, resulting in around ~99% shader utilization and rock steady Time value.

For a quad L4 in AWS (g6.12xlarge) I found 16 workers were needed to get the metrics stable, but shader utilization would not go over ~90% on a sliding window.

For a single L40s in AWS (g6e.2xlarge) I found 32 workers were needed to get the metrics stable, but shader utilization would not go over ~90% on a sliding window.

For a single L40s in AWS (g6e.8xlarge) I found 32 workers were needed to get the metrics stable, and shader uilization was stable at 99%.

For a quad L40s ins AWS (g6e.24xlarge) I was unable to get the shader utilization over about 80%, however I could get a stable Time value.

I looked at whether the SSD used to store the images might be the culprit. Not so clear. None of the setups generated bandwidth that should stress a decent SSD (326 MB/s at the highest), but I am not at all sure how to interpret the queue length data. Very short queues where the number of data workers was too low, expected. For Radeon Pro VII, L4 and L40s when I had enough workers to max out the shaders the SDD queue depth would be in the 0.22 to 2.7 range. For the quad L40s system that I just could not max out the SSD queue depth was modest, in around 0.8. But I have no idea what to make of AWS’ SSDs - how many bits of intervening DPUs or networking layers are actually between the CPU/GPU and the SSD.

What could be clearly observed was the SSD read rates (r/s and rkB/s) for any GPU setup scaled up as the number of workers was increased, up to the limit where shader utilization leveled off. A fully utilized Radeon VII Pro was sucking in 96MB/s of images, while a ~66% utilized quad L40s sucked in 326 MB/s.