H100 vs A100 Memory Usage Difference

Hi, I have been training a transformer model on a dataset on an A100 SXM4 with 40GB memory.

I decided to try and train the exact same model with the same scripts on the same dataset, but using an H100 PCIE with 80GB memory, hoping to potentially double the batch size and increase training efficiency.

However, I found that the exact same model seems to take more memory usage for the same batch size on the same dataset. Almost double the size, so I don’t really see any improvement.

Has anyone experienced this before? Or any ideas on what could cause the same model & batch size & data to have 2x the memory used in H100 vs A100?

No, I haven’t seen this effect before, so could you post a minimal and executable code snippet showing the 2x increase in memory usage for exactly the same workload, please?

I am unfortunately not able to share a minimal and executable code snippet as the model is very customised to the dataset and problem I’m working on.

However, I did a bit more debugging and I think I’ve found the root cause of the issue (possibly).

It seems like a potential issue with how caching is performed. I’ve noticed that on the H100, at the start of training the memory tends to shoot up to almost 2x the memory usage as the A100 on the same dataset & model. However, after a bit of time, the memory tends to decrease on the H100 and becomes more stable around 1.2x the memory usage of the A100.

I’ll be skipping the H100 for now, until I can spend more time debugging what appears to be the caching issues I’m experiencing.

It might also help if you can generate a memory snapshot of both runs- there’s a guide for it here: Understanding CUDA Memory Usage — PyTorch 2.5 documentation

1 Like

Wow, I had no clue this existed! Thank you for linking, I will try renting another H100 to experiment some more and try again. This looks like it will be very helpful in debugging the issue.