How to debug causes of GPU memory leaks?


(Ben Usman) #1

Hi,

I have been trying to figure out why my code crashes after several batches because of cuda memory error. I understand that probably there is some variable(s) that is not freed because I keep it in the graph. The question is how to debug that kind of thing?

I have wrote a simple line profiler that examines amount of GPU memory on each step, and it seems like on each loss.backwards() step there was massive memory allocation and not all of it frees, it that piles up for a couple of absolutely identical iterations leading to crash. It always crashes on backwards(). I understand that pytorch reuses memory and that is why it might seem like it is not freeing memory, but here is seems like something is indeed leaking.

I tried explicitly del’ing variables (e.g. with data) to mark them for reallocation by torch, no help. I also used CUDA_LAUNCH_BLOCKING=1 to force it to execute things “in place”.

[GPU memory trace]

Is there a way to get a memory footprint like “all tensors allocated on GPU”?

Changes are that it is not backwards() fault, e.g. something like
a +10Mb - 30 is free
b +10Mb - 20 is
c +20Mb - 0 is free
clean, b leaks and was not marked as free, 30 freed and not releazed
a +10Mb - no allocation (20 is actually free)
b +10Mb - no allocation, (10 is actually free)
c +20Mb - new large allocation, breaks, even though b is leaking

Anyway, how to track these things?


Non-invasive GPU memory profiling
(Spandan Madan) #2

Great question, hope someone answers. This will be useful to me too!


#3

In python, you can use the garbage collector’s book-keeping to print out the currently resident Tensors. Here’s a snippet that shows all the currently allocated Tensors:

# prints currently alive Tensors and Variables
import torch
import gc
for obj in gc.get_objects():
    if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
    print(type(obj), obj.size())

(Ben Usman) #4

Thanks! Seems to work with a try: except block around it (some objects like shared libraries throw exception when you try to do hasattr on them).

I extended my code that tracked memory usage to also track where memory allocations appeared by comparing set of tensors before and after operation. And results are somewhat surprising. I stopped execution after first batch (it breaks on gpu memory allocation on second batch) and memory consumption was higher in the case where less tensors were allocated O_o

Here is the diff between sorted two things below, i.e. the one that has more allocations ended up having less memory consumed. One possible explanation is that in model that requested more memory, single model was applied twice and gradient was computed, whereas in the “less consuming”, the trained model is applied once, and then a fixed model is applied ones.

< + start freeze_params:91                             (128, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (128, 64, 3, 3)      <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (128,)               <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (21, 21, 32, 32)     <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (21, 21, 4, 4)       <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (21, 4096, 1, 1)     <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (21, 512, 1, 1)      <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (21,)                <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (256, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (256, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (256,)               <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (4096, 4096, 1, 1)   <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (4096, 512, 7, 7)    <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (4096,)              <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (512, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (512, 512, 3, 3)     <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (512,)               <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (64, 3, 3, 3)        <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (64, 64, 3, 3)       <class 'torch.cuda.FloatTensor'>
< + start freeze_params:91                             (64,)                <class 'torch.cuda.FloatTensor'>

Here are original logs

:7938.9 Mb

+ __main__ match_source_target:174                   (1, 3, 1052, 1914)   <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:178                   (1, 3, 1052, 1914)   <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (128, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (128, 64, 3, 3)      <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (128,)               <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (21, 4096, 1, 1)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (21,)                <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (256, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (256, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (256,)               <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (4096, 4096, 1, 1)   <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (4096, 512, 7, 7)    <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (4096,)              <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (512, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (512, 512, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (512,)               <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (64, 3, 3, 3)        <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (64, 64, 3, 3)       <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (64,)                <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (128, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (128, 64, 3, 3)      <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (128,)               <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21, 21, 32, 32)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21, 21, 4, 4)       <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21, 4096, 1, 1)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21, 512, 1, 1)      <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21,)                <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (256, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (256, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (256,)               <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (4096, 4096, 1, 1)   <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (4096, 512, 7, 7)    <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (4096,)              <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (512, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (512, 512, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (512,)               <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (64, 3, 3, 3)        <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (64, 64, 3, 3)       <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (64,)                <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (1, 500)             <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (1,)                 <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (500, 21)            <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (500, 500)           <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (500,)               <class 'torch.cuda.FloatTensor'>
+ distances.mlp objective:26                         (2,)                 <class 'torch.cuda.FloatTensor'>
+ distances.mlp objective:33                         (1,)                 <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (1, 500)             <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (1,)                 <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (500, 21)            <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (500, 500)           <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (500,)               <class 'torch.cuda.FloatTensor'>
+ model.fcn16 __init__:13                            (21,)                <class 'torch.cuda.FloatTensor'>
+ model.fcn16 features_at:45                         (1, 4096, 34, 60)    <class 'torch.cuda.FloatTensor'>
+ model.fcn16 features_at:48                         (1, 4096, 34, 60)    <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (128, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (128, 64, 3, 3)      <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (128,)               <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (21, 21, 32, 32)     <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (21, 21, 4, 4)       <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (21, 4096, 1, 1)     <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (21, 512, 1, 1)      <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (21,)                <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (256, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (256, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (256,)               <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (4096, 4096, 1, 1)   <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (4096, 512, 7, 7)    <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (4096,)              <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (512, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (512, 512, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (512,)               <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (64, 3, 3, 3)        <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (64, 64, 3, 3)       <class 'torch.cuda.FloatTensor'>
+ start freeze_params:91                             (64,)                <class 'torch.cuda.FloatTensor'>

and in the second case - it has less tensors (i.e. identical to above except ~10 tensors less), but higher memory consumption and breaks on second epoch. Any ides?

:11820.9Mb 

+ __main__ match_source_target:174                   (1, 3, 1052, 1914)   <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:178                   (1, 3, 1052, 1914)   <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (128, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (128, 64, 3, 3)      <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (128,)               <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (21, 4096, 1, 1)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (21,)                <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (256, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (256, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (256,)               <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (4096, 4096, 1, 1)   <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (4096, 512, 7, 7)    <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (4096,)              <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (512, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (512, 512, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (512,)               <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (64, 3, 3, 3)        <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (64, 64, 3, 3)       <class 'torch.cuda.FloatTensor'>
+ __main__ match_source_target:190                   (64,)                <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (128, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (128, 64, 3, 3)      <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (128,)               <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21, 21, 32, 32)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21, 21, 4, 4)       <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21, 4096, 1, 1)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21, 512, 1, 1)      <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (21,)                <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (256, 128, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (256, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (256,)               <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (4096, 4096, 1, 1)   <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (4096, 512, 7, 7)    <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (4096,)              <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (512, 256, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (512, 512, 3, 3)     <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (512,)               <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (64, 3, 3, 3)        <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (64, 64, 3, 3)       <class 'torch.cuda.FloatTensor'>
+ __main__ run_adaptation:355                        (64,)                <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (1, 500)             <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (1,)                 <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (500, 21)            <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (500, 500)           <class 'torch.cuda.FloatTensor'>
+ distances.mlp _init_net:55                         (500,)               <class 'torch.cuda.FloatTensor'>
+ distances.mlp objective:26                         (2,)                 <class 'torch.cuda.FloatTensor'>
+ distances.mlp objective:33                         (1,)                 <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (1, 500)             <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (1,)                 <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (500, 21)            <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (500, 500)           <class 'torch.cuda.FloatTensor'>
+ distances.mlp_base attempt_update_d:77             (500,)               <class 'torch.cuda.FloatTensor'>
+ model.fcn16 __init__:13                            (21,)                <class 'torch.cuda.FloatTensor'>
+ model.fcn16 features_at:45                         (1, 4096, 34, 60)    <class 'torch.cuda.FloatTensor'>
+ model.fcn16 features_at:48                         (1, 4096, 34, 60)    <class 'torch.cuda.FloatTensor'>

Understanding GPU memory usage
#5

@Ben_Usman May I ask what did you use to generate your GPU memory trace? Perhaps python’s trace with added py3nvml calls?


#6

To comment on your question, do you use variable-sized batches as input? In that case, that might be caused by memory fragmentation (storages need to be re-allocated)…


(Ben Usman) #7

Thank you for a reply! Yes, that was set_trace with the following trace function.

I must have figured out the source of the leak by the way. It was due to the fact that significant portion of the code like variable allocation and intermediate computations was located within a single python function scope, so I suspect that those intermediate variable were not marked as free even though they were not used anywhere further. Putting a lot of del's kind of helped, but just isolating each individual step of computation into a separate function call so that all intermediate variable are automatically freed in the end of scope seems to be a better solution. Does that sound reasonable in context of pytorch?

I wonder if there could be some sort of “semantic garbage collection” that could detect that variable is not used anywhere further, thus could be freed.

Thanks!


#8

Thanks for showing your code!

I’m happy you’ve resolved your memory issue - it’s a very useful observation you’ve made and it’s good it’s now here in public. Indeed, Python’s lack of block scoping can sometimes delay object destruction unnecessarily long. Actually, I’ve used dels myself recently for releasing buffers at the end of each iteration in a loop processing variable-sized data. I’m afraid that “semantic garbage collection” is not technically possible because the language is dynamic.


(Carson McNeil) #9

Hello! I am trying to use this technique to debug, but the amount of GPU memory used seems to be an order of magnitude larger than the tensors being allocated.

After running a forward pass on my network I use the code above:

for obj in gc.get_objects():
    if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
        print(reduce(op.mul, obj.size()) if len(obj.size()) > 0 else 0, type(obj), obj.size())

GPU Mem used is around 10GB after a couple of forward/backward passes.

(161280, <class 'torch.autograd.variable.Variable'>, (5, 14, 3, 24, 32))
(451584, <class 'torch.autograd.variable.Variable'>, (14, 14, 3, 24, 32))
(612864, <class 'torch.autograd.variable.Variable'>, (19, 14, 3, 24, 32))
(612864, <class 'torch.autograd.variable.Variable'>, (19, 14, 3, 24, 32))
(2, <class 'torch.autograd.variable.Variable'>, (2,))
(420, <class 'torch.autograd.variable.Variable'>, (30, 1, 14))
(1026000, <class 'torch.autograd.variable.Variable'>, (19, 15, 450, 8))
(202, <class 'torch.autograd.variable.Variable'>, (2, 101))
(0, <class 'torch.autograd.variable.Variable'>, ())
(3, <class 'torch.autograd.variable.Variable'>, (3,))
(70, <class 'torch.autograd.variable.Variable'>, (5, 14))
(45, <class 'torch.autograd.variable.Variable'>, (45,))
(13230, <class 'torch.autograd.variable.Variable'>, (90, 3, 7, 7))
(90, <class 'torch.autograd.variable.Variable'>, (90,))
(10, <class 'torch.autograd.variable.Variable'>, (10,))
(735, <class 'torch.autograd.variable.Variable'>, (15, 1, 7, 7))
(15, <class 'torch.autograd.variable.Variable'>, (15,))
(8, <class 'torch.autograd.variable.Variable'>, (1, 1, 1, 8, 1))
(808, <class 'torch.autograd.variable.Variable'>, (101, 8))
(101, <class 'torch.autograd.variable.Variable'>, (101,))
(3, <class 'torch.autograd.variable.Variable'>, (3,))
(70, <class 'torch.autograd.variable.Variable'>, (5, 14))
(45, <class 'torch.autograd.variable.Variable'>, (45,))
(13230, <class 'torch.autograd.variable.Variable'>, (90, 3, 7, 7))
(90, <class 'torch.autograd.variable.Variable'>, (90,))
(10, <class 'torch.autograd.variable.Variable'>, (10,))
(735, <class 'torch.autograd.variable.Variable'>, (15, 1, 7, 7))
(15, <class 'torch.autograd.variable.Variable'>, (15,))
(8, <class 'torch.autograd.variable.Variable'>, (1, 1, 1, 8, 1))
(808, <class 'torch.autograd.variable.Variable'>, (101, 8))
(101, <class 'torch.autograd.variable.Variable'>, (101,))

You can see the biggest variable here should only total in at around 10MB, and altogether, they shouldn’t need much more space than this. Where is the hidden memory usage? My batch size IS variable, as referenced above, but I OOM after only a few batches. Am I confused about something, or is it using 10-100x more memory than it should?

Thanks for any insight!


(XDWang) #10

@Ben_Usman. Can share how to use gpu_profile.py ? Thank you!


(Even Oldridge) #11

Something to consider with variable sized batches is that pytorch allocates memory for the batches as needed and doesn’t free them inline because the cost of calling garbage collection during the training loop is too high. With variable batch sizes this can lead to multiple instances of the same buffer for the batch in memory.

If you make sure that your variably sized batches start with the largest batch then the initial memory allocated will be large enough to hold all batches and you won’t have crazy memory growth. The natural instinct of most programmers is to do the opposite if they’re ordering, which means that the same buffer gets allocated multiple times over the course of training and never gets freed. Even if it’s random there’s still a lot of unnecessary allocation going on.

I ran into this with a language model with a random backprop through time window in it’s batching and was able to reduce the memory requirements by an order magnitude by forcing the first batch to be the largest.