Multiple thread inference on one gpu can NOT fully utilise the gpu

I’m doing reinforcement learning with c++ frontend on one rtx4090D GPU. I opend mutiple thread and assign a stream to each, and then do inference with single module. The functun work well, but has some performance issue.
1 With very high thread number ,the throughput and gpu utilization decrease. The best thread number on my env is 4. But still just ultilise 50% of gpu.
2 If 2 processes with 4 thread each are lauched ,the gpu ultilization rate can be 100%, and throughput is doubled.

Seems the mutiple thread way is blocked some where?

I finally find it was block in lstm foward in get_dropout_state on a static varable:

static std::vector dropout_state_cache{
static_cast<size_t>(cuda::getNumGPUs())};

And the related comments says:

// Every time we use a dropout state, we need to synchronize with its event,
// to make sure all previous uses finish running before this one starts. Once
// we’re done, we record the event to allow others to synchronize with this
// kernel. Those events are really needed only for inter-stream sync on a
// single GPU. I doubt anyone will want to run cuDNN RNNs in parallel on a
// single GPU, so they should end up being complete no-ops.

In fact lstm forwad doesn’t need a DropoutState, but it still mutex lock on it.
It turn out c++ mutiple thread on single device is not well supported here.