As I use nvprof or pytorch profiler to profile my simple model, e.g., just resnet50. I can see that it uses multiple streams. However, the number of streams are different when I use data in fp32 mode and in fp16 mode:
for i in trange(len(dataloader)): data, *_ = next(data_iter) # for data, *_ in dataloader: if use_half: data = data.type(torch.float16).to(args.device) else: data = data.to(args.device) log_probs = model(data)
So, I am curious, if I build a simple model, like a sequential (Conv * 5 + MaxPool + Linear) model, or a resnet50 model. Without specifying any torch.cuda.Stream(), how did pytorch assigning the kernels to different cuda streams?
I checked the cuda streams related C++ code, as in torch/csrc/cuda/Stream.cpp, c10/cuda/CUDAStream.cpp, c10/core/Stream.h. These are the interfaces to cudaStream_t and how to get the streams by fetching it from stream pools. But I’m missing the linking mechanism part of how pytorch use the cuda streams by default. Could you shed some light on this, thanks in advance?
[Edit]: I just checked, I think most cuda operators if not calling cudnn or cublas, that is to say, implemented in pytorch directly, as found under aten/src/ATen/native/cuda path, many of the operators when launching cuda kernels, will pass the current cuda stream as in
cudaStream_t stream = at::cuda::getCurrentCUDAStream();
For other operators call into cudnn or cublas, etc. I cannot find the cuda stream parameter passing into the calling interface, e.g. convolution as in aten/src/ATen/native/cudnn/Conv_v7.cpp:
AT_CUDNN_CHECK_WITH_SHAPES(cudnnConvolutionForward( args.handle, &one, args.idesc.desc(), input.data_ptr(), args.wdesc.desc(), weight.data_ptr(), args.cdesc.desc(), fwdAlgPerf.algo, workspace.data_ptr(), fwdAlgPerf.memory, &zero, args.odesc.desc(), output.data_ptr()), args, "Forward algorithm: ", static_cast<int>(fwdAlgPerf.algo), "\n"); }
So, I guess cudnn library will manage the cudaStreams from there on(, but unfortunately we cannot see the source code ). Am I right?
So, it might be that the libcudnn.so implemented fp16 which might use TensorCores in different algo than fp32, so it might use different cuda streams mechanisms.