2 Nvidia A40,but show error "torch.cuda.OutOfMemory"Error

when I run the LLM of int8 of 70B, when I need a 70G video memory program, the system loads about 46G video memory of an A40 graphics card, and another A40 graphics card has not been started, and an error is reported, as follows:

torch.empty(self.output_size,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB (GPU 0; 44.42 GiB total capacity; 43.11 GiB already allocated; 426.81 MiB free; 43.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Please help me!

Reduce the memory requirement by e.g. decreasing the batch size, via checkpointing, CPU-offloading etc.

Thank you!But why can’t we solve this problem by working together with two A40 graphics cards?And I have installed nvlink,But the result is still the same!

You can use both GPUs, but would need to implement it explicitly e.g. via pipeline parallel approaches as PyTorch won’t shard the model automatically.