Support for Gradient Computation on Int Tensors in Multi-GPU and Distributed Training

yanghaojin · August 8, 2024, 8:33am

Hello, thank you for PyTorch, such a powerful tool!

Background Information: Our team is dedicated to extending the boundaries of PyTorch to lower bit depths. Therefore, we established the BitTorch Engine project, which includes components and GPU kernel implementations for 4, 2, and 1-bit scenarios. For low-bit components like the QLinear layer, we have implemented the forward and backward passes based on autograd.function (code example: [link]. Additionally, we have developed an optimizer called DiodeMix (bitorch-engine/blob/main/bitorch_engine/optim/diode_beta.py#L37) that can train these low-bit components (our paper “Enhancing Optimization Robustness in 1-bit Neural Networks through Stochastic Sign Descent” has been accepted by ECCV 2024).

The Problem: We are currently facing a limitation where PyTorch restricts gradient computation to tensor types of float and complex, and does not support Int types for gradient computation. In the BitTorch Engine, we have made some adaptations at the Python code level, allowing us to compute gradients for Int tensors on a single machine with a single GPU. However, when it comes to multi-GPU and distributed training, we are still constrained by this limitation.

We are hesitant to modify the underlying C++ code directly, but we plan to expand our current work to multi-GPU and distributed scenarios. We would greatly appreciate your help and advice on how to relax the constraint on gradient computation for Int tensors without affecting other functionalities. Thank you very much!