Cuda stream not in parallel for element-wise/matrix extraction

This post might be helpful explaining the compute resources with a link to a great GTC talk explaining it in more details.