Dear fellow PyTorch users, I would like to propose a project that may increase your run speeds by x10.
A tremendous academic effort has gone into the design and implementation of efficient neural networks in recent years to cope with the ever-increasing amount of data on ever-smaller and more efficient devices. Yet, as of the time of writing, most researchers are unaware of even the most basic acceleration techniques for deep learning on GPUs.
Especially in academia, many do not even use Automatic Mixed Precision (AMP), which can reduce memory requirements to 1/4 and increase speeds by x4~5. This is the case even though AMP can be enabled without much hassle using the PyTorch Lightning or HuggingFace Accelerate libraries.
Even the novice who has only just dipped their toes into the murky depths of deep learning knows that more compute is a key ingredient for success. No matter how brilliant the scientist, outperforming a rival with x10 more compute is no mean feat.
I have created a repository with the aim of enabling researchers and engineers without much knowledge of GPUs, CUDA, Docker, etc. to squeeze every last drop of performance from their GPUs using the same hardware and neural networks.
If you are among those who before could only long for a quicker end to the hours and days spent staring at Tensorboard as your models inched past the epochs, this project may be just the thing for you. When using a source build of PyTorch with the latest version of CUDA, combined with AMP, one may achieve training/inference times x10 faster than a naïve PyTorch environment.
I sincerely hope that my project will be of service to practitioners in both academia and industry.
Please show your appreciation by starring my repository.