How to pre-allocate a tensor and the its gradient?

I need to communicate with other processes to gather data, so I must pre-register the memory location to the network card before RDMA send/recv to gain a better performance. I only use CPU, and implementing a custom CPU allocator seems to be a good choice.

I noticed that it had been discussed before(Control Tensor Memory Allocation?), but I wonder whether the CPU allocator has a memory pool(and where is the code).

#include <iostream>
#include <torch/all.h>

int main() {
    auto option = torch::TensorOptions().requires_grad(true).dtype(torch::kFloat32);
    torch::Tensor a = torch::tensor({1, 2, 3}, option);
    torch::Tensor b = torch::tensor({4, 5, 6}, option);
    torch::Tensor c = a * b;
    torch::Tensor grad = torch::autograd::grad({c}, {a}, {torch::ones_like(c)})[0];

    return 0;
}

I traced this demo above until posix_memalign was called, I didn’t find any memory pool or cached malloc.

The next question is how to pre-allocate the gradient of a tensor. Or is it possible to pre-allocate it?