Torch.jit.trace consumes more than 40x gpu memory

I’m using torch.jit.trace to export a vov99 model to torchscript Module on a A100 with 80GB gpu memory.

When I inference with the torch original torch.nn.Module model, it works well and only consumes around 2GB gpu memory. However when I trace the model with the same example input tensor, it runs out of 80GB gpu memory.

here is my code:

    backbone = Net(cfgs)
    checkpoint = load_checkpoint(
        backbone, args.weights, map_location='cuda', strict=False,
        logger=logging.Logger(__name__, logging.ERROR)
    )
    backbone.cuda()
    backbone.eval()

    # nn.Module inference tooks around 2GB gpu memory
    example_forward_input = torch.rand(1, 3, 640, 1600).cuda()
    original_feats = backbone(example_forward_input)
    for r in original_feats:
        print(r.shape)

    time.sleep(5)

    # Trace a specific method and construct `ScriptModule` with
    # here a OOM error raised: RuntimeError: CUDA out of memory. 
    # Tried to allocate 142.00 MiB (GPU 0; 79.33 GiB total capacity; 78.28 GiB already allocat
    # ed; 7.81 MiB free; 78.81 GiB reserved in total by PyTorch) If reserved memory is >> 
    # allocated memory try setting max_split_size_mb to avoid fragmentation.  
    # See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    backbone_module = torch.jit.trace(backbone, example_forward_input)
    backbone_module.save('outputs/traced_backbone.pt')

hi @Junxuan_Chen ,
did you get any solution of this?
I am facing the same issue with my pytorch model which is around 100MB and I want to convert it into torch script, when I use a small dummy input [1,3, 240, 328] while converting it works fine but I want the input size (1, 3, 720, 1280) which may be the cause of running out of memory. I have tried to run on 8GB RAM GPU and also on CPU with 30GB RAM.

I have also tried different methods and versions for converting, but the issue still persists. however, I have noticed that there’s same issue with pytorch to onnx conversion.