PyTorch, threading, multiple GPUs

I have very intense task with matrices. I want to pass a tensor to GPU in a separate thread and get the result of performed operations.
I created a class - Worker with interface compute that do all the work and returns the result. Now, I want to pass 4 class instances along with tensors to separate threads for computing on all my 4 GPUs.
The code:

workers = [
    Worker(64, device= torch.device('cuda:0')),
    Worker(64, device= torch.device('cuda:1')),
    Worker(64, device= torch.device('cuda:2')),
    Worker(64, device= torch.device('cuda:3'))
matrices = [tensor1, tensor2, tensor3, tensor4]
output = []
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    for worker, matr in zip(workers, matrices):
        result = executor.submit(worker.compute, matr)

But output[0].result() throws the following error .

“CUDA error: an illegal memory access was encountered”

I think the code inside the class is fine, because everything works on every GPU without threads.
I am new to PyTorch, help me please.

Today, in the morning I found out that the problem may cause custom CUDA kernel I use. Without it everything works. Any way, I appreciate any suggestions and good practices on using threading with PyTorch, because I am not sure my code exploits a good way.

Why do you need to use multithreading?
You can do everything in the main thread, whether you want to run multiple instances within the same GPU or to run one instance per GPU.

I think in this case tasks will perform consequentially not simultaneously.
The thing I want is simultaneous computation on all of my GPUs.