Torch.cuda.memory_snapshot() shows an unexpected 8519680B memory allocation during matrix multiplication

I met an unexpected memory allocation in torch.cuda.memory_snapshot() when I ran the following code:

import torch

W = torch.rand(1000, 10).cuda()
b = torch.rand(10).cuda()
X = torch.rand(1000).cuda()

y = X @ W + b
print(torch.cuda.memory_snapshot())

Since W requests 1000 * 10 * 4B = 40000B, b requests 10 * 4B = 40B, X requests 1000 * 4B = 4000B, y requests 10 * 4B = 40B, and there is a matrix alignment mechanism, if everything is as expected, the output should look something like this (and I did get the similar result in another env, this output is from torch 1.10.0):

[
  {
    device: 0,
    address: 68780294144,
    total_size: 2097152,
    allocated_size: 45568,
    active_size: 45568,
    segment_type: "small",
    blocks: [
      { size: 40448, state: "active_allocated" },
      { size: 512, state: "active_allocated" },
      { size: 4096, state: "active_allocated" },
      { size: 512, state: "inactive" },
      { size: 512, state: "active_allocated" },
      { size: 2051072, state: "inactive" },
    ],
  },
];

But this time, besides the output above, I got some additional output like this:

{
  device: 0,
  address: 23007248515072,
  total_size: 20971520,
  allocated_size: 8519680,
  active_size: 8519680,
  requested_size: 8519680,
  stream: 0,
  segment_type: "large",
  blocks: [
    { size: 8519680, requested_size: 8519680, state: "active_allocated" },
    { size: 12451840, requested_size: 0, state: "inactive" },
  ],
}

It always occupies this fixed size: 8519680B, and this only happens when I do matrix multiplications or use a nn.Linear. If I use nn.Conv2d or nn.RNN, this magic number will disappear, just as expected.

The environment in which I met this problem: PyTorch 2.0.0 + CUDA 11.4 + Ubuntu 20.04.

The environment in which I didn’t meet this problem: PyTorch 1.10.0 + CUDA 11.6 + Windows 10.

Any information will be helpful.

This behavior looks expected as we are allocating a cuBLAS workspace.
You could delete it via torch._C._cuda_clearCublasWorkspaces() and rerun the memory snapshot again.
Also @eqy pointed out that e.g. removing the cuBLAS workspace via:

import os
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":0:0"

before creating the handle should avoid the memory allocation.

2 Likes

That’s right, thank you!

Could you tell me something about what’s under the hood of the 8519680B memory? I want to know more about why it is this value.

The default cuBLAS workspace size is defined here as:

const size_t default_size = sm90 ? 4096 * 8 * 1024 : 4096 * 1024 * 2 + 16 * 1024 * 8;

Assuming you are not using a Hopper device (sm_90) you would thus use:

4096 * 1024 * 2 + 16 * 1024 * 8 = 8519680 bytes

which is split into different internal workspaces.

1 Like

Thank you for your excellent reply :smiling_face_with_three_hearts:

Thank you for your patience.

I have another question that why this magic number(i.e. 8519680B, if not using sm_90 devices) be chosen as the default? Is it for performance consideration, or for kernel scheduling correctness, or some other reason?

For performance reasons a larger workspace is used by default on Hopper GPUs (H100 in this case).
Since these devices also come with a large memory pool increasing the cuBLAS workspace will most likely not be noticed.

That’s interesting. So the workspace’s size is an empirical value observed in practice, or is there some theoretical basis behind it?

From the docs:

Too small workspaceSizeInBytes may cause some routines to fail with CUBLAS_STATUS_ALLOC_FAILED error returned or cause large regressions in performance. Workspace size equal to or larger than 16KiB is enough to prevent CUBLAS_STATUS_ALLOC_FAILED error, while a larger workspace can provide performance benefits for some routines. Recommended size of user-provided workspace is at least 4MiB (to match cuBLAS’ default workspace pool).

The workspace is used for routines running in parallel streams which use this workspace to store intermediates.

Thanks again for your kindness.
As the documentation says, the recommended size is at least 4MiB. I’m just wondering what will happen if I use 4MiB (rather than the current default_size) as the default workspace size. Will it cause performance degradation or non-determinism problems?