I have been training a multi-layer Resnet with stucture
DataParallel(
(module): resnet(
(phi): Tanh()
(stack): ModuleList(
(0): Linear(in_features=5000, out_features=3000, bias=True)
(1): Block(
(L1): Linear(in_features=3000, out_features=3000, bias=True)
(L2): Linear(in_features=3000, out_features=3000, bias=True)
(phi): Tanh()
)
(2): Block(
(L1): Linear(in_features=3000, out_features=3000, bias=True)
(L2): Linear(in_features=3000, out_features=3000, bias=True)
(phi): Tanh()
)
(3): Block(
(L1): Linear(in_features=3000, out_features=3000, bias=True)
(L2): Linear(in_features=3000, out_features=3000, bias=True)
(phi): Tanh()
)
(4): Linear(in_features=3000, out_features=1460, bias=True)
)
)
)
I have noticed that the time for the input layer
Linear(in_features=5000, out_features=3000, bias=True)
have cost time:
1.201575756072998 layer:0
in the meanwhile
0.0006377696990966797 layer:1
0.0002562999725341797 layer:2
0.00022459030151367188 layer:3
7.271766662597656e-05 layer:4
why would this happen?
...
x_input=torch.rand(1,5000).to(device)
pre = model(x_input)
..
def forward(self, x):
# first layer
for i in range(len(self.stack)):
t1=time.time()
x = self.stack[i](x)
t2=time.time()
print(t2-t1,'layer:{}'.format(i))
#with 4*nvidia rtx 3090