How to check the GPU memory being used?

Haziq · September 6, 2021, 7:11am

I am running a model in eval mode. I wrote these lines of code after the forward pass to look at the memory in use.

print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
print("torch.cuda.memory_reserved: %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
print("torch.cuda.max_memory_reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))

which prints out

torch.cuda.memory_allocated: 0.004499GB
torch.cuda.memory_reserved: 0.007812GB
torch.cuda.max_memory_reserved: 0.007812GB

However, running nvidia-smi tells me that python is using 1.349 GB. What causes the difference?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   57C    P0    33W /  N/A |   2392MiB /  7982MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1103      G   /usr/lib/xorg/Xorg                106MiB |
|    0   N/A  N/A      1702      G   /usr/lib/xorg/Xorg                476MiB |
|    0   N/A  N/A      1874      G   /usr/bin/gnome-shell               87MiB |
|    0   N/A  N/A      2331      G   ...AAAAAAAAA= --shared-files       51MiB |
|    0   N/A  N/A      4307      G   /usr/lib/firefox/firefox          175MiB |
|    0   N/A  N/A      4569      G   /usr/lib/firefox/firefox           37MiB |
|    0   N/A  N/A     21370      G   ...AAAAAAAAA= --shared-files       33MiB |
|    0   N/A  N/A     24668      G   ...AAAAAAAAA= --shared-files       56MiB |
|    0   N/A  N/A     25867      C   python                           1349MiB |
+-----------------------------------------------------------------------------+

ptrblck · September 6, 2021, 7:18am

The CUDA context needs approx. 600-1000MB of GPU memory depending on the used CUDA version as well as device. I don’t know, if your prints worked correctly, as you would only use ~4MB, which is quite small for an entire training script (assuming you are not using a tiny model).

Haziq · September 6, 2021, 7:39am

Thank you for responding. I am indeed using a very small model with only around 200k total parameters during training. The values above are when I set the model to eval mode and a batch size of 128. I also used another function torch.cuda.memory_summary() which printed the following

Validation batch  54  of  298
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    4717 KB |    8050 KB |  159767 KB |  155050 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    4717 KB |    8050 KB |  159767 KB |  155050 KB |
|---------------------------------------------------------------------------|
| Active memory         |    4717 KB |    8050 KB |  159767 KB |  155050 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    4717 KB |    8050 KB |  159767 KB |  155050 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    8192 KB |    8192 KB |    8192 KB |       0 B  |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 B  |
|       from small pool |    8192 KB |    8192 KB |    8192 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |    3474 KB |    3592 KB |  162857 KB |  159382 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    3474 KB |    3592 KB |  162857 KB |  159382 KB |
|---------------------------------------------------------------------------|
| Allocations           |     174    |     280    |    3635    |    3461    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |     174    |     280    |    3635    |    3461    |
|---------------------------------------------------------------------------|
| Active allocs         |     174    |     280    |    3635    |    3461    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |     174    |     280    |    3635    |    3461    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       4    |       4    |       4    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       4    |       4    |       4    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       5    |       8    |    1273    |    1268    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       5    |       8    |    1273    |    1268    |
|===========================================================================|

which also shows that my model is using 4717 KB (allocated memory) at test time which is equivalent to 0.004499GB.

I have two last questions please.

A batch size of 128 prints torch.cuda.memory_allocated: 0.004499GB whereas increasing it to 1024 prints torch.cuda.memory_allocated: 0.005283GB. Can I confirm that the difference of approximately 1MB is only due to the increased batch size?
And is there a reason why nvidia-smi goes from 1349MiB to 1355MiB when going from a batch size of 128 to 1024? The increase is not consistent with the 1MB I got from the previous point. Does the memory used by CUDA context also vary based on the batch size?

ptrblck · September 6, 2021, 8:28am

PyTorch will allocate memory from the large or small pool, which has defined page sizes, so the reserved memory might be larger than the exact bytes needed to store the tensor.

Your current description of the model doesn’t fit the reported memory via nvidia-smi, so could you post the model definition as well as the input shape?

Haziq · September 6, 2021, 8:52am

This is the model definition. The sizes are below the definition. The model is made up of 2 VAEs. The input inp_pose and key_pose are each torch.Size([128, 63]).

class model(nn.Module):
    def __init__(self, args):
        super(model, self).__init__()
                 
        for key, value in args.__dict__.items():
            setattr(self, key, value)
                        
        """
        Pose
        """
          
        self.inp_pose_encoder = make_mlp([self.pose_encoder_units[0]+3]+self.pose_encoder_units[1:],self.pose_encoder_activations)
        self.key_pose_encoder = make_mlp(self.pose_encoder_units,self.pose_encoder_activations)
        self.pose_mu          = make_mlp(self.pose_mu_var_units,self.pose_mu_var_activations)
        self.pose_log_var     = make_mlp(self.pose_mu_var_units,self.pose_mu_var_activations)
        self.key_pose_decoder = make_mlp(self.pose_decoder_units,self.pose_decoder_activations)
        
        """
        Time
        """
        
        self.delta_pose_encoder = make_mlp(self.delta_pose_encoder_units,self.delta_pose_encoder_activations)
        self.time_encoder       = make_mlp(self.time_encoder_units,self.time_encoder_activations)
        self.time_mu            = make_mlp(self.time_mu_var_units,self.time_mu_var_activations)
        self.time_log_var       = make_mlp(self.time_mu_var_units,self.time_mu_var_activations)
        self.time_decoder       = make_mlp(self.time_decoder_units,self.time_decoder_activations)
        
        self.norm = tdist.Normal(torch.tensor(0.0), torch.tensor(1.0))
                        
    def forward(self, data, mode):
    
        #batch_size = data["inp_pose"].shape[0]
    
        inp_pose = data["inp_pose"].view(self.batch_size,-1) 
        key_pose = data["key_pose"].view(self.batch_size,-1)
        key_object = data["key_object"].view(self.batch_size,-1)
               
        """
        compute pose
        """
    
        # feed x and y
        inp_pose_features = torch.cat((inp_pose, key_object), dim=1)
        inp_pose_features = self.inp_pose_encoder(inp_pose_features)
        key_pose_features = self.key_pose_encoder(key_pose)
        
        # get gaussian parameters
        pose_posterior = torch.cat((inp_pose_features,key_pose_features),dim=1)
        pose_posterior_mu = self.pose_mu(pose_posterior)
        pose_posterior_log_var = self.pose_log_var(pose_posterior)
        
        # sample
        pose_posterior_std = torch.exp(0.5*pose_posterior_log_var)
        pose_posterior_eps = self.norm.sample([self.batch_size, pose_posterior_mu.shape[1]]).cuda()
        pose_posterior_z   = pose_posterior_mu + pose_posterior_eps*pose_posterior_std
        
        z_p = pose_posterior_z if mode == "tr" else self.norm.sample([self.batch_size, self.pose_mu_var_units[-1]]).cuda()
        
        # forecast
        pred_key_pose = torch.cat((z_p,inp_pose_features),dim=1)
        pred_key_pose = self.key_pose_decoder(pred_key_pose)
        
        """
        compute time
        """
        
        # compute delta_pose
        delta_pose = key_pose - inp_pose
        
        # feed x and y
        delta_pose_features = self.delta_pose_encoder(delta_pose)
        time_features = self.time_encoder(data["time"].unsqueeze(1))
        
        # get gaussian parameters
        time_posterior = torch.cat((delta_pose_features,time_features),dim=1)
        time_posterior_mu = self.time_mu(time_posterior)
        time_posterior_log_var = self.time_log_var(time_posterior)
        
        # sample
        time_posterior_std = torch.exp(0.5*time_posterior_log_var)
        time_posterior_eps = self.norm.sample([self.batch_size, time_posterior_mu.shape[1]]).cuda()
        time_posterior_z   = time_posterior_mu + time_posterior_eps*time_posterior_std
        
        z_t = time_posterior_z if mode == "tr" else self.norm.sample([self.batch_size, self.time_mu_var_units[-1]]).cuda()
              
        # compute time
        time = torch.cat((z_t,delta_pose_features),dim=1)
        time = torch.squeeze(self.time_decoder(time))   
        
        return {"key_pose":pred_key_pose.view(self.batch_size,21,3), "time":time,
                "pose_posterior":{"mu":pose_posterior_mu, "log_var":pose_posterior_log_var}, 
                "time_posterior":{"mu":time_posterior_mu, "log_var":time_posterior_log_var}}

And the specific units

model(
  (inp_pose_encoder): Sequential(
    (0): Linear(in_features=66, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=128, bias=True)
    (3): ReLU()
  )
  (key_pose_encoder): Sequential(
    (0): Linear(in_features=63, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=128, bias=True)
    (3): ReLU()
  )
  (pose_mu): Sequential(
    (0): Linear(in_features=256, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=8, bias=True)
  )
  (pose_log_var): Sequential(
    (0): Linear(in_features=256, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=8, bias=True)
  )
  (key_pose_decoder): Sequential(
    (0): Linear(in_features=136, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=63, bias=True)
  )
  (delta_pose_encoder): Sequential(
    (0): Linear(in_features=63, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=128, bias=True)
    (3): ReLU()
  )
  (time_encoder): Sequential(
    (0): Linear(in_features=1, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=16, bias=True)
    (3): ReLU()
  )
  (time_mu): Sequential(
    (0): Linear(in_features=144, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=8, bias=True)
  )
  (time_log_var): Sequential(
    (0): Linear(in_features=144, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=8, bias=True)
  )
  (time_decoder): Sequential(
    (0): Linear(in_features=136, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=1, bias=True)
    (3): ReLU()
  )
)

vdi · October 9, 2024, 7:59pm

device = torch.device('cuda:0')
free, total = torch.cuda.mem_get_info(device)
mem_used_MB = (total - free) / 1024 ** 2
print(mem_used_mb)

gives me the exact number as it is displayed in nvidia-smi