Any suggestions or tools to diagnose the training processes done on two different CUDA cards

I’m training a SSD model, one runing on a V100s, the other running on 1080ti. And I have set up a target mAP e.g., to achieve at 20% and keep a log of how many epochs it took to arrive at that target mAP. I also keep all the settings the same(e.g, batch size, RNG seeds) for V100s and 1080ti. But it seems that for my V100s, it’s quite steady to take 100 epochs to hit the target mAP. For my 1080ti, some times it takes more than 100 epochs, sometimes it does take 100 epochs to hit the target mAP.
So, I just want to analyze what are the possible reasons for this and what are some methods to verify the reasonings.
Here is what I think:
V100s and 1080ti are completely different series Nvidia GPU cards, they have different CUDA computability, one is 7+, the other is 6+. The architectures are also different. So there might be some algorigthms implemented diffferntly for them. The minute difference(I guess for the same CUDA op, the output shouldn’t be too different if given same inputs.) in all the ops in the model might eventually add up to make a difference?
Methods to verify this?
I think I could use the forward/backward hooks to record the intermediate results including parameters, layer outputs, and grads wrt inputs and weights. And try to figure out when the divergence happens. But the dumped file is quite large and layer results are not in the same magitude, I easily get lost. Any good practice for this?
If it’s not for the architecture reason, what are some other possible reasons? And any methods to verify? Thanks.

Just to rule out one source of architecture differences: is mixed precision (e.g., fp16 training) or some other form of reduced-precision being used? More variability across architectures would be expected with reduced precision.

Yes, I am using mixed precision in my training.

Interesting, naively I would have expected the V100 to be potentially “worse” in convergence @fp16 because of the possibility of reduced precision reductions involving tensor cores.

Can you provide some more details about the model architecture (it sounds like if mAP is the metric then it would be a conv-heavy object detection model)? In this case I would not expect the use of mixed-precision to be the reason for slower convergence, but it might be useful to double check what happens in single precision if you can spare the extra compute time.

As for comparing results such as model weights and outputs, you might consider taking a look at : Reproducibility — PyTorch 1.10.0 documentation to see if reproducibility can be improved, but as you correctly note there won’t be any guarantees across cards with different hardware architectures, especially in half-precision layers.

Thanks for your insights on tensor cores. I’m training SSD (Single Shot Detection) model. So on 1080ti without tensor cores, how does it do fp16 calculations? Why would you expect it to be more accurate?

I think it would use “conventional” SM hardware to do fp16 computation, meaning that there wouldn’t be extreme speedups expected over fp32, but also no strange rounding behavior (e.g., from reduced precision accumulations).

There are likely litmus tests you can construct from this e.g., create/train an fp32 model on one setup, create a mixed-precision copy on both setups, and see which one is closer to the “ground-truth” in output.