Torch.cuda.memory_snapshot() shows an unexpected 8519680B memory allocation during matrix multiplication

HermitSun · April 27, 2023, 12:39pm

I met an unexpected memory allocation in torch.cuda.memory_snapshot() when I ran the following code:

import torch

W = torch.rand(1000, 10).cuda()
b = torch.rand(10).cuda()
X = torch.rand(1000).cuda()

y = X @ W + b
print(torch.cuda.memory_snapshot())

Since W requests 1000 * 10 * 4B = 40000B, b requests 10 * 4B = 40B, X requests 1000 * 4B = 4000B, y requests 10 * 4B = 40B, and there is a matrix alignment mechanism, if everything is as expected, the output should look something like this (and I did get the similar result in another env, this output is from torch 1.10.0):

[
  {
    device: 0,
    address: 68780294144,
    total_size: 2097152,
    allocated_size: 45568,
    active_size: 45568,
    segment_type: "small",
    blocks: [
      { size: 40448, state: "active_allocated" },
      { size: 512, state: "active_allocated" },
      { size: 4096, state: "active_allocated" },
      { size: 512, state: "inactive" },
      { size: 512, state: "active_allocated" },
      { size: 2051072, state: "inactive" },
    ],
  },
];

But this time, besides the output above, I got some additional output like this:

{
  device: 0,
  address: 23007248515072,
  total_size: 20971520,
  allocated_size: 8519680,
  active_size: 8519680,
  requested_size: 8519680,
  stream: 0,
  segment_type: "large",
  blocks: [
    { size: 8519680, requested_size: 8519680, state: "active_allocated" },
    { size: 12451840, requested_size: 0, state: "inactive" },
  ],
}

It always occupies this fixed size: 8519680B, and this only happens when I do matrix multiplications or use a nn.Linear. If I use nn.Conv2d or nn.RNN, this magic number will disappear, just as expected.

The environment in which I met this problem: PyTorch 2.0.0 + CUDA 11.4 + Ubuntu 20.04.

The environment in which I didn’t meet this problem: PyTorch 1.10.0 + CUDA 11.6 + Windows 10.

Any information will be helpful.

ptrblck · April 27, 2023, 9:04pm

This behavior looks expected as we are allocating a cuBLAS workspace.
You could delete it via torch._C._cuda_clearCublasWorkspaces() and rerun the memory snapshot again.
Also @eqy pointed out that e.g. removing the cuBLAS workspace via:

import os
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":0:0"

before creating the handle should avoid the memory allocation.

HermitSun · April 28, 2023, 3:52am

That’s right, thank you!

Could you tell me something about what’s under the hood of the 8519680B memory? I want to know more about why it is this value.

ptrblck · April 28, 2023, 7:08am

The default cuBLAS workspace size is defined here as:

const size_t default_size = sm90 ? 4096 * 8 * 1024 : 4096 * 1024 * 2 + 16 * 1024 * 8;

Assuming you are not using a Hopper device (sm_90) you would thus use:

4096 * 1024 * 2 + 16 * 1024 * 8 = 8519680 bytes

which is split into different internal workspaces.

HermitSun · April 28, 2023, 8:36am

Thank you for your excellent reply

HermitSun · April 28, 2023, 1:29pm

Thank you for your patience.

I have another question that why this magic number(i.e. 8519680B, if not using sm_90 devices) be chosen as the default? Is it for performance consideration, or for kernel scheduling correctness, or some other reason?

ptrblck · April 28, 2023, 5:16pm

For performance reasons a larger workspace is used by default on Hopper GPUs (H100 in this case).
Since these devices also come with a large memory pool increasing the cuBLAS workspace will most likely not be noticed.

HermitSun · April 29, 2023, 2:17am

That’s interesting. So the workspace’s size is an empirical value observed in practice, or is there some theoretical basis behind it?

ptrblck · April 29, 2023, 5:19am

From the docs:

Too small workspaceSizeInBytes may cause some routines to fail with CUBLAS_STATUS_ALLOC_FAILED error returned or cause large regressions in performance. Workspace size equal to or larger than 16KiB is enough to prevent CUBLAS_STATUS_ALLOC_FAILED error, while a larger workspace can provide performance benefits for some routines. Recommended size of user-provided workspace is at least 4MiB (to match cuBLAS’ default workspace pool).

The workspace is used for routines running in parallel streams which use this workspace to store intermediates.

HermitSun · April 29, 2023, 7:24am

Thanks again for your kindness.
As the documentation says, the recommended size is at least 4MiB. I’m just wondering what will happen if I use 4MiB (rather than the current default_size) as the default workspace size. Will it cause performance degradation or non-determinism problems?