Help Explaining Latency of Multi-threaded Implementation

Hello all :slightly_smiling_face: .
I have written the below C++ extension (CPU) to fill nodes into fixed-size buckets in a FIFO manner. Each node goes into a single bucket in each of the L tables. I parallelize over L to avoid collisions.

Dimensions Used:- num_nodes=524288, num_buckets=4096, bucket_size=128, num_threads=48
(I use a system with about 1000G RAM, supporting up to 96 parallel threads.)

When using values of L < num_threads, I expect latency to be relatively constant w.r.t L. However, I find that latency varies significantly when I change L. What could be the reason for this? Does it have to do with caching? Thanks in advance.

L = 12 => 5.3450 ms
L = 24 => 12.8900 ms
L = 36 => 22.2119 ms
L = 48 => 29.6163 ms

void fill_buckets_FIFO(
        const torch::Tensor& indices,
        torch::Tensor& buckets,
        torch::Tensor& bucket_counts) {
    
    int32_t num_nodes = indices.size(0);
    int32_t L = buckets.size(0);
    // int32_t num_buckets = buckets.size(1);
    int32_t bucket_size = buckets.size(2);

    auto buckets_0 = buckets.accessor<int32_t, 3>(); // L x num_buckets x bucket_size
    auto bucket_counts_0 = bucket_counts.accessor<int32_t, 2>(); // L x num_buckets
    auto indices_0 = indices.accessor<int32_t, 2>(); // num_nodes x L

    at::parallel_for(0, L, 0, [&](int32_t start, int32_t end) {
        for (int32_t l = start; l < end; l++) {
            auto buckets_1 = buckets_0[l];
            auto bucket_counts_1 = bucket_counts_0[l];

            for(int32_t i = 0; i < num_nodes; i++) {
                int32_t bucket_index = indices_0[i][l];
                int32_t &bucket_count = bucket_counts_1[bucket_index];
                buckets_1[bucket_index][bucket_count % bucket_size] = i;
                bucket_count++;
            }
        }
    });
}