Help on using stream in multithread to accelerate

I need some help about using stream in multithread to accelerate my computation. Could you help me and give me some advise?

my problem:
the time killed by using multithread and multistreams is seem to using serial loop. it doesn’t speed up.

my codes:
I used a thread-pool with 10 threads. I add my task into threads. In thread task funciton, I get a free GPUStream and do some job in this GPUStream.

the thread task is as follow:
void* task_thread(void** arg)

auto options = torch::TensorOptions().device(torch::kCUDA, 0);
torch::Device device(torch::kCUDA, 0);

at::cuda::CUDAStream mystream = at::cuda::getStreamFromPool();

    at::cuda::CUDAStreamGuard guard(mystream);
    std::cout << "Stream ID: " << << std::endl;

    torch::Tensor* pt_base_feature_cpu = (torch::Tensor*) arg[0];
    torch::Tensor* pt_match_feature_cpu = (torch::Tensor*) arg[1];

    for(int i = 0; i < 10; i++)
        torch::Tensor base_feature = (pt_base_feature_cpu->slice(0, i*50000, (i+1)*50000, 1)).to(device);
        torch::Tensor match_feature = (*pt_match_feature_cpu).to(device);

        torch::Tensor tensor_tmp;
        torch::Tensor tensor_sum;
        std::tuple<torch::Tensor, torch::Tensor> sort_ret;

        tensor_tmp = torch::sub(base_feature, match_feature);
        tensor_tmp = torch::pow(tensor_tmp, 2);
        tensor_sum = torch::sum(tensor_tmp, 1);
        sort_ret = torch::topk(tensor_sum, 1);


my test:
I test the time killed in serial way and in parallel way.
in serial way:
task function I use default stream and do the same job. I use for loop to add task in serial way.
for(i=0; i<taskcnt; ++i)

in parallel way:
I use for loop to add task in multi thread. in thread I use a new stream.
for(i=0; i<taskcnt; ++i)
tp_threadpool_add_task(&pool, task_thread, arg);

However, I found that the way in serial is faster run the way in parallel.

Could you give me some advise to improve in parrel?