Hello all .
I have written the below C++ extension (CPU) to fill nodes into fixed-size buckets in a FIFO manner. Each node goes into a single bucket in each of the L tables. I parallelize over L to avoid collisions.
Dimensions Used:- num_nodes=524288, num_buckets=4096, bucket_size=128, num_threads=48
(I use a system with about 1000G RAM, supporting up to 96 parallel threads.)
When using values of L < num_threads, I expect latency to be relatively constant w.r.t L. However, I find that latency varies significantly when I change L. What could be the reason for this? Does it have to do with caching? Thanks in advance.
L = 12 => 5.3450 ms
L = 24 => 12.8900 ms
L = 36 => 22.2119 ms
L = 48 => 29.6163 ms
void fill_buckets_FIFO(
const torch::Tensor& indices,
torch::Tensor& buckets,
torch::Tensor& bucket_counts) {
int32_t num_nodes = indices.size(0);
int32_t L = buckets.size(0);
// int32_t num_buckets = buckets.size(1);
int32_t bucket_size = buckets.size(2);
auto buckets_0 = buckets.accessor<int32_t, 3>(); // L x num_buckets x bucket_size
auto bucket_counts_0 = bucket_counts.accessor<int32_t, 2>(); // L x num_buckets
auto indices_0 = indices.accessor<int32_t, 2>(); // num_nodes x L
at::parallel_for(0, L, 0, [&](int32_t start, int32_t end) {
for (int32_t l = start; l < end; l++) {
auto buckets_1 = buckets_0[l];
auto bucket_counts_1 = bucket_counts_0[l];
for(int32_t i = 0; i < num_nodes; i++) {
int32_t bucket_index = indices_0[i][l];
int32_t &bucket_count = bucket_counts_1[bucket_index];
buckets_1[bucket_index][bucket_count % bucket_size] = i;
bucket_count++;
}
}
});
}