I try to do some benchmarking of the SDDMM operator in DGL on GPU and CPU. However, I observe some inconsistency between the data transfer time measurement done in pytorch (DGL use pytorch as backend) and NvProf. (data transfer time between host/cpu and devices/gpu)
Here is the problem: Based on the python measurement, the Host to Device transfer time should be 123.78ms. (transfer graph)+672.80ms(transfer feature vector). However, the time reported by NvProf suggest the Host to Device data transfer time is 672.59ms, which is the total time for Host to Device data transfer time even it consists of three calls. Obviously, the graph transfer time is not accounted. I did some followup study. It seems the graph transfer time is about several ms, which is shorter than the time reported in python measurement.
On the other hand, the Device to Host data seems align with each other. The measurement done in python suggest the time is 1784.44ms + 7.0600ms. The NvProf report it as 1.78915s, which is the total time for Host to Device data transfer time even it consists of three calls. It seems both methods agree with each other.
The kernel execution time more or less agree with each other. 2.42782s/13 = 186.77ms. The NvProf agree with the measurement in python.
As such, I would like to know which one is the most accurate one that report the data transfer time? Is there any bug in the benchmarking code that makes the data transfer time inaccurate? How should I measure data transfer from/to GPU correctly?
Any help is appreciated!
I attach the benchmarking code, time reported by python, and measurement reported by NvProf.
Here is the benchmarking code.
import dgl import torch import dgl.ops as F import time import csv cpu_dry_run = gpu_dry_run=3 cpu_benchmark_run =gpu_benchmark_run = 10 f = open('./result_dgl_test.csv', 'w') writer = csv.writer(f) feature_size_list =  nodes_idx_list =  node_num_factor_list =  # choose node ID from nodes_num writer.writerow(["feature size","total nodes num", "num edges","CPU Time","GPU Time"]) for node_num_factor in node_num_factor_list: for nodes_idx in nodes_idx_list: for feature_size in feature_size_list: # nodes ID range from 0 to nodes_num -1 # the actual number of nodes is the nodes_num*edge_num_factor src = torch.randint(nodes_idx,(nodes_idx*node_num_factor,)) dst = torch.randint(nodes_idx,(nodes_idx*node_num_factor,)) # construct the graph g = dgl.graph((src, dst)) print(g.device) y=torch.randn(g.num_nodes(), feature_size) # transfer the graph torch.cuda.synchronize() start_graph_transfer = time.time() g_cuda = g.to('cuda:0') torch.cuda.synchronize() end_graph_transfer = time.time() print("graph HtoD time: ",(end_graph_transfer - start_graph_transfer)*1000,"ms") # transfer the feature vector start_feature_transfer = time.time() y_cuda = y.to('cuda:0') torch.cuda.synchronize() end_feature_transfer = time.time() print("feature HtoD time: ",(end_feature_transfer - start_feature_transfer)*1000,"ms") # do some dry run for i in range(gpu_dry_run): test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v") # synchronize torch.cuda.synchronize() start = time.time() for i in range(gpu_benchmark_run): test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v") torch.cuda.synchronize() end = time.time() gpu_time_per_iteration = (end - start)/float(gpu_benchmark_run) print("gpu",gpu_time_per_iteration*1000," ms") start_result_transfer = time.time() y_host=y_cuda.to(torch.device('cpu')) torch.cuda.synchronize() end_result_transfer = time.time() print("feature DtoH time: ",(end_result_transfer - start_result_transfer)*1000,"ms") start_graph_post_transfer = time.time() g_host=g_cuda.to(torch.device('cpu')) torch.cuda.synchronize() end_graph_post_transfer = time.time() print("graph DtoH time: ",(end_graph_post_transfer - start_graph_post_transfer)*1000,"ms") for i in range(cpu_dry_run): test=F.gsddmm(g,"dot",y,y,"u","v") start = time.time() for i in range(cpu_benchmark_run): test=F.gsddmm(g,"dot",y,y,"u","v") end = time.time() cpu_time_per_iteration = (end - start)/float(cpu_benchmark_run) print("cpu: ",cpu_time_per_iteration*1000," ms") writer.writerow([feature_size,g.num_nodes(),g.num_edges(), cpu_time_per_iteration*1000,gpu_time_per_iteration*1000]) f.close()
Output of python measurement
graph HtoD time: 123.78549575805664 ms feature HtoD time: 672.8084087371826 ms gpu 186.7778778076172 ms feature DtoH time: 1784.4436168670654 ms graph DtoH time: 7.060050964355469 ms cpu: 2687.334632873535 ms
output from NvProf:
$nvprof --openacc-profiling off python test.py
==20523== Profiling application: python test.py ==20523== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 49.65% 2.42782s 13 186.76ms 186.73ms 186.80ms void dgl::aten::cuda::SDDMMCooTreeReduceKernel<long, float, bool=0, bool=0, int=0, int=2>(float const *, float const , dgl::aten::cuda::SDDMMCooTreeReduceKernel<long, float, bool=0, bool=0, int=0, int=2>*, long const *, long const , long const , long, long, long, long, long const *, long const , long, long, long) 36.59% 1.78915s 3 596.38ms 2.4019ms 1.78375s [CUDA memcpy DtoH] 13.76% 672.59ms 3 224.20ms 1.3035ms 669.99ms [CUDA memcpy HtoD] 0.00% 100.23us 13 7.7100us 7.3920us 8.0640us void at::native::vectorized_elementwise_kernel<int=4, at::native::FillFunctor<float>, at::detail::Array<char*, int=1>>(int, float, at::native::FillFunctor<float>) API calls: 60.48% 3.96078s 7 565.83ms 3.5270us 1.86660s cudaDeviceSynchronize 39.45% 2.58347s 6 430.58ms 1.2656ms 1.78431s cudaMemcpyAsync 0.04% 2.8964ms 3 965.46us 171.75us 2.5392ms cudaMalloc 0.01% 381.72us 2 190.86us 184.74us 196.97us cuDeviceTotalMem 0.00% 196.90us 199 989ns 92ns 44.063us cuDeviceGetAttribute 0.00% 192.80us 26 7.4150us 4.5920us 29.041us cudaLaunchKernel 0.00% 191.61us 4 47.902us 4.2050us 174.83us cudaStreamSynchronize 0.00% 161.05us 2 80.524us 79.216us 81.832us cudaGetDeviceProperties 0.00% 65.128us 365 178ns 91ns 3.5310us cuGetProcAddress 0.00% 60.072us 171 351ns 193ns 2.4380us cudaGetDevice 0.00% 34.584us 2 17.292us 16.233us 18.351us cuDeviceGetName 0.00% 12.732us 3 4.2440us 1.9840us 8.1440us cudaStreamIsCapturing 0.00% 11.261us 4 2.8150us 1.5570us 5.4280us cudaSetDevice 0.00% 7.3280us 29 252ns 162ns 684ns cudaGetLastError 0.00% 5.8970us 1 5.8970us 5.8970us 5.8970us cuDeviceGetPCIBusId 0.00% 2.1270us 1 2.1270us 2.1270us 2.1270us cuInit 0.00% 1.6290us 4 407ns 114ns 1.0420us cuDeviceGetCount 0.00% 1.0850us 3 361ns 154ns 704ns cudaGetDeviceCount 0.00% 996ns 3 332ns 94ns 561ns cuDeviceGet 0.00% 589ns 3 196ns 104ns 369ns cuDevicePrimaryCtxGetState 0.00% 454ns 2 227ns 187ns 267ns cuDeviceGetUuid 0.00% 146ns 1 146ns 146ns 146ns cuDriverGetVersion