Optimizers not support foreach & fused in LibTorch

Hi, I am using Libtorch 2.3.1 + cu118 to train my model. I have massive number of parameters and I divide them into groups for fine-grained optimzation. For example, I may use Adam and each param have different beta or lr, they may not have the same step, neither. After I read the source code of Libtorch, I found the realization of Adam is a single for-loop and there is no parallel accleration (foreach or fused based on CUDA). Is it not planned to develop? Is it possible for a single person to write the CUDA acceleration code?