Hi, PyTorch Community!
I’m currently working on a deep learning project focused on computer vision, utilizing CNNs for inference. Our setup involves a multi-GPU environment, specifically using 8 GPUs, to handle our workload. While we’ve been able to achieve the performance we aimed for, there’s a notable challenge we’re facing: the initial “warm-up” time.
Problem Description: Our program takes approximately 50 seconds to “warm-up” every time we start it. This warm-up phase significantly impacts our overall efficiency, especially in scenarios where quick restarts are necessary. Upon investigation, we’ve identified that a substantial portion of this time is likely due to the JIT (Just-In-Time) compilation process, particularly the fusion and optimization of CUDA kernels.
Environment Details:
- PyTorch version: 2.2.2
- CUDA version: 12.1
- Number of GPUs: 8
- GPU model: NVIDIA A10G
- Nature of the workload: Inference using deep CNNs for computer vision tasks
Main Question: Is there a way to serialize and persist the state of the JIT fused CUDA kernels to disk after the first compilation? Our goal is to do this “warm-up” process once and then reuse the optimized state in subsequent runs to eliminate or significantly reduce the warm-up time.
Sub-questions:
- If this serialization and persistence is possible, what are the steps to achieve it?
- How can we then reload this saved state from the disk before running our inference workload to bypass the JIT compilation process?
Why This Matters: Reducing or eliminating the warm-up time in our environment could improve our system’s responsiveness and overall throughput. Given the scale at which we’re operating, even small efficiency improvements can lead to significant benefits.
I’d appreciate any guidance or suggestions. If anyone has tackled similar challenges or knows of potential solutions, your input would be appreciated.
Thank you in advance!