import os
os.environ["CUDA_VISIBLE_DEVICES"] = '1'
import torch
import torch.nn as nn
from torch.autograd import Variable
import time
from modules import *
def main():
test_data64 = Variable(torch.randn(1,64,56,56))
m_baseline64 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1, bias=False)
time1_list = []
for i in range(10000):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
m_baseline64(test_data64)
end.record()
torch.cuda.synchronize()
time1_list.append(start.elapsed_time(end))
print('m_baseline64:%.8f' %(sum(time1_list[1000:])/len(time1_list[1000:])))
test_data36 = Variable(torch.randn(1,36,56,56))
m_baseline36 = nn.Conv2d(36, 36, kernel_size=3, stride=1, padding=1, bias=False)
time2_list = []
for i in range(10000):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
m_baseline36(test_data36)
end.record()
torch.cuda.synchronize()
time2_list.append(start.elapsed_time(end))
print('m_baseline36:%.8f' %(sum(time2_list[1000:])/len(time2_list[1000:])))
if __name__=='__main__':
main()
The theoretical flops for ‘’‘m_baseline64’’’ should be about 3.16 times than ‘’‘m_baseline36’’’
However, when I run the code in GPU mode, the real time of m_baseline64 and m_baseline36 is 0.100s and 0.089s, respectively, which is very strange. Then I run the code on CPU mode, the real time of m_baseline64 and m_baseline36 is 4.536s and 1.684s, which is match for the theoretical ratio.
Anyone can help me?
I think that it happens because no matter if you use 36 or 64 filters, gpu computes it in parallel think that each filter is independent from the others
Because gpus are supposed to be multi-core machines. If you check gpu specifications gpus have thousens of cores rather than 4-10 (like a cpu). They can run dozens of process in parallel. That’s why they speed up machine learning, as ML are basically hundreds of independent operations. Each filter in a convolution is totally independent from others.
It’s easier you to google why gpus speed up ML cos it’s a large explanation
Although GPU specialises in parallel computation it can only perform vectorized math and in this case tensors. And not general purpose that need sequential work, even callback mechanics might misbehave with GPUs. Also each individual cores of GPUs are significantly weaker than CPUs.
Although in case of GPU programing we tend to prefer high level APIs and avoid programing dealing with thread, processes and so on.
@JuanFMontesinos I have found that the realtime acceleration rate on CPU is also not same as the theoretical flops reduction rate, what other things will bring additional time cost ?
Well, cpu is a complex world. It depends a lot on which library do you use, how well optimized is it, which cpu model do you have… Think that not all the cpus can execute same basic operations.
Toy example, supose you have a cpu which can sum but not multiply. It will need lot of cycles to perform a multiplication meanwhile if cpu has an specific module enabled will be much faster.
In fact there is one specific operation very important for deep learning whose name I forgot. It’s a multiplication + sum.
I have tested the realtime reduction rate on the convolution operation. The theoretical flops reduction rate is always higher than the realtime reduction rate. For example, when the theoretical flops reduction rate is
26.04%, the realtime ruduction rate is about 17%, when the theoretical flops reduction rate is 43.75%, the realtime reduction rate is about 35%. I don’t know what brings the additional time cost.
I have no real answer for that as it’s hardware dependent. Are you timing only the time required for convolution? without measuring time required to load tensors into gpu and so on