I need to communicate with other processes to gather data, so I must pre-register the memory location to the network card before RDMA send/recv to gain a better performance. I only use CPU, and implementing a custom CPU allocator seems to be a good choice.
I noticed that it had been discussed before(Control Tensor Memory Allocation?), but I wonder whether the CPU allocator has a memory pool(and where is the code).
#include <iostream>
#include <torch/all.h>
int main() {
auto option = torch::TensorOptions().requires_grad(true).dtype(torch::kFloat32);
torch::Tensor a = torch::tensor({1, 2, 3}, option);
torch::Tensor b = torch::tensor({4, 5, 6}, option);
torch::Tensor c = a * b;
torch::Tensor grad = torch::autograd::grad({c}, {a}, {torch::ones_like(c)})[0];
return 0;
}
I traced this demo above until posix_memalign
was called, I didn’t find any memory pool or cached malloc.
The next question is how to pre-allocate the gradient of a tensor. Or is it possible to pre-allocate it?