GPU Memory Leak with Zombie Memory Occupation after Job is Killed

I have been developing the following library: GraSS/_GradComp/attributor/attributor.py at main · TRAIS-Lab/GraSS · GitHub for a while, and I have ran into a weird GPU memory leak problem on large-scale experiments (specifically, 8B model with 1B token dataset). Let me briefly summarize what this library is trying to do to give some context:

  1. cache_gradient (in base.py): compute per-sample gradients for train dataloader. No memory leakage for this step.
  2. compute_preconditioner: load all per-sample gradients, compute their (layer-wise) outer product and sum them over. At the end, compute their inverses, i.e., preconditioners. See some memory build-up but not significant enough to cause OOM.
  3. compute_ifvp: compute the matrix-vector products between the preconditioners (step 2) with the per-sample gradients (step 1). See some memory build-up but not significant enough to cause OOM.
  4. attribute: compute per-sample gradients for test dataloader first, and compute all pair-wise inner products between these test per-sample gradients and preconditioned train per-sample gradient (step 3). This step will cause OOM.

If you look into the code, you’ll find that with offload=disk (i.e., I’ll save/load every gradient/ifvp/preconditioners to/from disk), ideally the code should only occupied a fixed amount of memory since nothing is kept in the memory: I move all the results to cpu followed by a save request, or in the case of loading from disk, it is just prefetching with the dataloader. All actual computations on GPU are fixed size (only a small batches of the entire dataset is loaded) and are simple operations (matrix multiplications).

However, what happened in practice is that the memory built up slowly, and eventually cause OOM in some cases. Another weird thing is that when the task ends (either by keyboard interruption or OOM), there is still some memory occupied on GPU. For instance, this is what happened when I run attribute for a while and then kill the job manually:

The memory builds up slowly (the left part), and when I kill it, there’s still some memory occupying the GPU memory (the right part). This is what leads me to think there’s some memory leakage. I can’t find related posts regarding this behavior, and I wonder how this can be possible.

I’m sorry that I do not provide a MWE since this only happens in large-scale. I would be happy to provide any further information since I’m stuck for quite a while and can’t pinpoint the root cause of this weird phenomenon.

On the other hand, I have tried to profile the memory usage following https://pytorch.org/blog/understanding-gpu-memory-1/ to debug myself. But I do not see any useful information.

I really have no idea what’s going on, needless to say how to debug it. I really want to understand why this happens and how can I fix this.

For reference, the following is the code snippet I used to profile the code blocks I mentioned, with the start_record_memory_history(), export_memory_snapshot(), and export_memory_snapshot() defined as in the profile tutorial:

# Start recording memory snapshot history
        start_record_memory_history()

        # Create dataloader for IFVP with optimal batch size
        train_ifvp_dataloader = self.strategy.create_gradient_dataloader(
            data_type="ifvp",
            batch_size=2,
            pin_memory=True
        )
        torch.cuda.empty_cache()

        # Create the memory snapshot file
        export_memory_snapshot()
        start_record_memory_history()

        logger.info("Starting efficient double-batched attribution computation")

        # Configure test batching for memory efficiency
        test_batch_size = min(32, test_sample_count)  # Process test samples in chunks
        logger.debug(f"Using test batch size: {test_batch_size}")

        iteration = 0
        # Single pass through training IFVP data with nested test batching
        for chunk_tensor, batch_mapping in tqdm(train_ifvp_dataloader, desc="Computing attribution"):
            # Move train chunk to device
            chunk_tensor_device = self.strategy.move_to_device(chunk_tensor).to(dtype=all_test_gradients.dtype)

            # Process test gradients in batches to save memory
            for test_start in range(0, test_sample_count, test_batch_size):
                test_end = min(test_start + test_batch_size, test_sample_count)
                test_batch = all_test_gradients[test_start:test_end]

                # Move test batch to device
                test_batch_device = self.strategy.move_to_device(test_batch)

                # Efficient batched matrix multiplication for this (train_chunk, test_batch) pair
                # Shape: (chunk_samples, proj_dim) @ (proj_dim, test_batch_samples) -> (chunk_samples, test_batch_samples)
                chunk_scores = torch.matmul(chunk_tensor_device, test_batch_device.t())

                # Map chunk results back to global sample indices
                for batch_idx, (start_row, end_row) in batch_mapping.items():
                    if batch_idx not in batch_to_sample_mapping:
                        continue

                    train_start, train_end = batch_to_sample_mapping[batch_idx]
                    batch_scores = chunk_scores[start_row:end_row]
                    IF_score[train_start:train_end, test_start:test_end] = batch_scores.to(IF_score.device)

                # Clean up test batch from device
                # del test_batch_device, chunk_scores
                torch.cuda.empty_cache()

            # Clean up train chunk from device
            # del chunk_tensor_device
            torch.cuda.empty_cache()
            iteration += 1
            if iteration == 100:
                break

        # Create the memory snapshot file
        export_memory_snapshot()

Did you check if the process was really killed? Even if PyTorch would leak memory once the process is killed the memory would be released by destroying the CUDA context.

Hi, I rechecked that in both nvtop and htop, and the following is what I have:

  1. nvtop: I kill the job with ctrl+d when I see a memory built up, and nvtop shows that the memory occupation after the job is terminated.
  2. htop: Meanwhile, I also checked the htop and didn’t see anything suspicious or job-like process under my user (pbb).:
2965915 pbb         20   0 22912 13632  8640 S   0.0  0.0  0:00.04 /usr/lib/systemd/systemd --user
2965916 pbb         20   0  185M 11712     0 S   0.0  0.0  0:00.00 (sd-pam)
2965924 pbb         20   0  9344  4736  3520 S   0.0  0.0  0:00.00 /bin/bash /var/spool/slurmd/job242578/slurm_script
2965930 pbb         20   0  7552  1408   640 S   0.0  0.0  0:00.00 sleep 9000
2966010 pbb         20   0 31936 10624  2368 S   0.0  0.0  0:00.01 sshd: pbb@pts/0
2966011 pbb         20   0 11072  5632  4032 S   0.0  0.0  0:00.01 -bash
2966095 pbb         20   0 31936 10624  2368 S   0.0  0.0  0:00.00 sshd: pbb@pts/1
2966096 pbb         20   0 11072  5056  3648 S   0.0  0.0  0:00.01 -bash
2966338 pbb         20   0 31936 10624  2368 S   0.0  0.0  0:00.02 sshd: pbb@pts/2
2966339 pbb         20   0 11072  5120  3648 S   0.0  0.0  0:00.01 -bash
2966423 pbb         20   0 31680  7296  3840 R   0.0  0.0  0:05.20 htop
2966759 pbb         20   0 35072 18560  3904 S   0.7  0.0  0:01.38 nvtop

If this helps with the debugging, I’m using the TACC VISTA cluster (Vista - TACC HPC Documentation)'s Grace Hopper subsystem (arch64).

Does this mean you cannot reproduce any memory usage after you terminated the process?

What do you mean by “reproducing memory usage after terminating the process”? I think in the following:

what I try to show is that there’s still (according to nvtop) some memory occupying the GPU memory after I manually killed the job (verified in both nvtop and htop: seems like every job is killed). (Might not be significant since I killed the job after the first memory built up: but if you look closely, the yellow line is non-zero and there’s nothing running)

Same for the following: