Parallelism with single GPU (streams?)

I have a single GPU which is not fully used by my model. For this reason, I’d like to parallelise multiple forward calls on the same GPU.

I’d prefer, if possible, to avoid multiprocessing. As I understand, cuda streams should help me on this, but I’ve not been able to get performance improvements, as if the streams were executed sequentially anyway.

Can I obtain parallelism on the same GPU by using cuda streams without multiprocessing?