The fastest way to calculate 5mil*5mil dot products? I have two A100 and 96 core CPU with 500gb memory

I need to multiply a 5mil*768 matrix by it’s transpose. So, approximately 5mil**2 dot products of vectors vectors of 768 elements. As mentioned in title, my resources are:

  • 96 core CPU.
  • 500gb of memory
  • Two Nvidia A100 GPUs.

Please, help me, how do I do it the fastest, considering my memory (and GPU memory) constraints.