H100 vs A100 Memory Usage Difference

Ty4Reading · January 2, 2025, 5:02am

Hi, I have been training a transformer model on a dataset on an A100 SXM4 with 40GB memory.

I decided to try and train the exact same model with the same scripts on the same dataset, but using an H100 PCIE with 80GB memory, hoping to potentially double the batch size and increase training efficiency.

However, I found that the exact same model seems to take more memory usage for the same batch size on the same dataset. Almost double the size, so I don’t really see any improvement.

Has anyone experienced this before? Or any ideas on what could cause the same model & batch size & data to have 2x the memory used in H100 vs A100?

ptrblck · January 2, 2025, 3:52pm

No, I haven’t seen this effect before, so could you post a minimal and executable code snippet showing the 2x increase in memory usage for exactly the same workload, please?

Ty4Reading · January 5, 2025, 12:57am

I am unfortunately not able to share a minimal and executable code snippet as the model is very customised to the dataset and problem I’m working on.

However, I did a bit more debugging and I think I’ve found the root cause of the issue (possibly).

It seems like a potential issue with how caching is performed. I’ve noticed that on the H100, at the start of training the memory tends to shoot up to almost 2x the memory usage as the A100 on the same dataset & model. However, after a bit of time, the memory tends to decrease on the H100 and becomes more stable around 1.2x the memory usage of the A100.

I’ll be skipping the H100 for now, until I can spend more time debugging what appears to be the caching issues I’m experiencing.

bdhirsh · January 5, 2025, 1:05am

It might also help if you can generate a memory snapshot of both runs- there’s a guide for it here: Understanding CUDA Memory Usage — PyTorch 2.5 documentation

Ty4Reading · January 5, 2025, 2:30am

Wow, I had no clue this existed! Thank you for linking, I will try renting another H100 to experiment some more and try again. This looks like it will be very helpful in debugging the issue.

ekomp · February 4, 2025, 11:37pm

I am having the same issue, except A40 and H100.

Unfortunately, my “working” gpus are the H100s and the A40s are just on the interactive node I deved my script on… so cannot ignore it. If anyone has a solution would love to know. I will attempt to log the memory usage on both pieces of hardware as @bdhirsh suggested.

ekomp · February 5, 2025, 1:10am

Interestingly:

This does not happen for me with 1 gpu. Only when running data parallel. Additionally, I can run a full eval loop at the start of training no problem, only when trying to back propagate does the memory explode.
Oddly, running the script a second time (on a job) does not have this issue. If I end the job and restart it, it happens again.

I tried running the snapshot as suggested, but the pickle file blows up the visualizer and throws an error.

@Ty4Reading is any of this consistent?

ekomp · February 5, 2025, 9:15pm

This issue was tentatively solved by solving a stride mismatch error eg. UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 4, 12], strides() = [1, 12, 1] bucket_view.sizes() = [1, 4, 12], strides() = [48, 12, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:327.)

Not sure why this would be related or it it is coincidental, but I am no longer experiencing memory explosion on new h100 jobs.