Is evaluating the network thread-safe?

Batching will help in the sense that instead of calling X times a small op, you will call once an X times larger op. Even though in theory these are the same. The second one reduces any overhead from the whole framework as you pay once that overhead instead of X times (and pytorch is known to be expensive).

1 Like

So I ran some more tests based on your remark that batching should help. I can confirm that the small batch size (of 1), together with the small model, is the culprit here. When I switch to a decent batch size (e.g. 100 or so), then the parallelization speedup is close to 16 times (=the number hardware threads). So c++ pytorch allows for parallelization as advertised; great!
Apparently, somehow the “overhead” part of pytorch does not parallelize very well. I am still at a loss why this is. I think it might have to do with very frequent memory allocations and deallocations, but honestly, my guess is as good as anybody’s. But the overhead is only really important for very small models with too small batch sizes.
@albanD : Thanks a lot!

1 Like

Hi,
Haven’t found this topic before because of slightly misleading title, but it seems that I ran into same issue (only in my case network is even smaller than the one used by @Willem). Batching does not help me much because each worker calls forward and grad independently and in unsynchronized manner. Does this “overhead” problem have any resolution? Maybe thread-local allocation?