Is evaluating the network thread-safe?

m2q · February 21, 2019, 1:24am

I’m sorry if the answer to this question is obvious, but I’m not sure: If I have multiple parallel threads, and each thread has its own input tensor; will evaluating the net->forward() from each thread happen in parallel?

(Btw you people have done insanely good work with LibTorch!!! Keept it up)

albanD · February 21, 2019, 10:17am

Hi,

I think forward ops are but not backward.
@goldsborough should be able to give you a more decisive answer for libtorch.

yf225 · February 21, 2019, 9:22pm

I synced with @goldsborough and here is the answer on thread safety:

net->forward() is just an interface, and user could put whatever they want into that method. So technically, there’s no guarantee that it is thread safe, unless we know for sure that it doesn’t mutate any field of the net in place.

i.e. if net contains a weight_ tensor, and one thread does mul_() and another reads from it at the same time, there’s a race. We could make the access to this mul_() call thread-safe by having a lock, but it’s something that user has to add on their own.

m2q · February 21, 2019, 10:04pm

Thanks for the answer, I should’ve been more specific. I’m still not quite sure what you meant with weight_. Consider this super basic case:

torch::nn::Linear fc1 = {nullptr}; //assume this is initialied
torch::Tensor forward(torch::Tensor input) override 
{
        image = torch::softmax(fc1->forward(input));
        return image;
}

Now the thing I do in each thread would be something like

auto input_tensor = torch::tensor({1, 2, 3, 4, 5});
auto output_tensor =  net->forward(input_tensor);

// read data
float f = output_tensor.data<float>()[0]; //or something like that

If we assume the weights are fixed and no fancy training is going on, can I read from output_tensor safely? Just want to make sure

yf225 · February 21, 2019, 10:11pm

In that case, since fc1 is not changed in forward(), it should be safe to run forward() on multiple threads.

m2q · February 21, 2019, 10:17pm

AWESOME! Thanks so much for the help, I love pytorch.

Willem · November 22, 2019, 10:17pm

So just to be sure here: If I call

torch::NoGradGuard guard;
torch::Tensor prediction = model->forward(features);

from multiple parallel threads. Are these inferences going to interfere with each other? The forward is defined as in the above code…

yf225 · November 22, 2019, 10:34pm

Are these inferences going to interfere with each other?

The short answer is no, unless your forward code explicitly writes to the same parameter / buffer without any mutex mechanism.

Willem · November 23, 2019, 6:47am

Thanks. As mentioned, my forward mechanism is just running through a bunch of linear layers and relu. So I should be good… Thanks!

Willem · November 25, 2019, 9:41am

Sorry, one more related question after running some more tests:

Multicore computations aren’t giving the factor speedup that I would expect. This could be caused by torch::NoGradGuard containing a Mutex mechanism. Does it?

Does the forward code on standard components (e.g. linear layers) contain mutexes?

Any other reason why this doesn’t seem to parallelize completely?

albanD · November 25, 2019, 3:46pm

Hi,

torch::NoGradGuard does not contain any mutex, only a thread local variable.

There are no mutexes that I know of in any of the regular layers.

You might want to control the number of threads used by open mp (corresponding to this python api). If you do large enough ops, they will parallelize internally and might be trying to use too many ressources.

Willem · November 29, 2019, 12:20pm

Hi Alban:

Thanks. I have a fairly small network, 3 layers (20 X 100, 100 X50,50 X30). Each time, I forward a single vector. It seems unlikely that it will paralellize this, right? (This is a RL setting, cpu-only.)

Still, my profiling shows that computing over 1 thread takes 10 seconds, while spawning 4 threads (with std::async) takes something like 8 seconds.

I have a 16 cpu-core system, so I cannot understand this.

Any ideas?

albanD · December 1, 2019, 5:11pm

If you run your program while setting OMP_NUM_THREADS=1 ./your_code does it helps?

Willem · December 2, 2019, 10:14am

Not really; it slows down both the parallel and the nonparallel version of the code, but doesn’t result in the expected speedup of parallel vs nonparallel.

I have done some profiling using gprof. Program spends a lot of time in following three functions:

35% (of total):
std::_Sp_counted_base<(__gnu_cxx::Lock_policy)2>::M_release()
29%:
c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset()
11%:
c10::intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<<c10::TensorImpl >::reset()

Following code uses the torch model:

auto features = torch::empty({ 1,(long)(numFeats) });
		for (size_t i = 0; i < numFeats; i++)
		{
			features[0][i] = mdp.GetFeature(state, i);
		}
		//Note that NoGradGuard was called elsewhere
		torch::Tensor prediction = model->forward(features);			
		std::vector<float> logprobs(prediction.data<float>(), prediction.data<float>() + prediction.numel());

Any idea what could be causing this? Maybe the memory allocations/dealloacations? Is it possible to reuse the same memory for this?

Definition of model:

    torch::Tensor forward(torch::Tensor input) {
    {
	return torch::relu(linear3(torch::relu(linear2(torch::relu(linear1(input))))));
}
torch::nn::Linear linear1, linear2, linear3;

albanD · December 2, 2019, 4:01pm

Hi,

After asking to other people, this should have greatly improved the multithreaded performances. You might want to try again with a build from master as it was merged about a week ago.

Also since pytorch used to be only used from python, multithreading was not a thing (because of the GIL mostly). So this has been fairly recent work in the core of pytorch but it should get better very soon.

Also if you have a model that behaves particularly bad and you can make a simple reproducible example, we can take a look into it.

Willem · December 3, 2019, 8:46am

Installing the newest PyTorch indeed did help, so that is great! Still, there seems to be significant (factor 2) slowdown if I run work in parallel, compared to running it serially.

I attach a reproducible example, where a lot of inference is done in parallel. (Note that the inference model is random here, to keep things simple. In production, this would be a pretrained model…) When the workers run in parallel, they take about 100 ms to complete. In series, they take about 45 ms. So, accounting for the fact that I run 16 cores, I get a 7* speedup from parallelization, where I would expect 14-16 times, since this is (or should be) completely CPU-bound. I believe this model is too small (and especially batch of 1) for PyTorch to do meaningful parallelization under the hood here.

Ps I hope I didn’t do any stupid things in implementation here. Any improvement to the implementation here is of course also very much appreciated!

albanD · December 3, 2019, 3:46pm

Thanks for the code sample. This is indeed a very small model !

Can you try replacing torch::NoGradGuard guard; by at::AutoNonVariableTypeMode non_var_type_mode(true);? This is not supposed to be public api but might help for now.

Willem · December 3, 2019, 7:59pm

Welcome. Note that this continues to be a problem when the model is a bit bigger (say 10 times bigger), though the problem becomes less pronounced.

I tried the replacement along the lines you suggest, it does not make a substantial difference. Maybe some benefit, but hard to measure. Parallel is still about factor 2 slower than serial.

albanD · December 3, 2019, 8:14pm

One last thing as well is that the blas libraries (like mkl) have a tendency to aggressively use compute cores and I think that in a multithreaded environment, they can actually slow each other down.
You definitely want to do batching if possible to avoid this.
Will keep you updated if I get further ideas.

Willem · December 3, 2019, 8:17pm

Thanks. Your idea that MKL might be behind this sounds plausible. I now compile with -skylake (not sure about exact relation to MKL). Will remove to see if it helps.

Also: do you have any idea if the compiler can be told to use one and only one core for each thread?

You mention that batching will help. Can you elaborate on how this will actually help? It seems that batching on each thread will only encourage the use of multiple cores, which will slow things down.
__
Compiling without --skylake support slows down both serial and parallel execution. Gap narrows a bit, but not much. I think your hypothesis that in parallel mode, there are less options to use multiple cores sounds very plausible. All in all, the code parallel code is 3 times faster since I started this thread, so thanks a lot for your help and responsiveness! (Still would love to hear how you would think that batching would help exactly…)