The time for multi-layer ResNet

ZeyuanLee · May 26, 2021, 7:17am

I have been training a multi-layer Resnet with stucture


DataParallel(
  (module): resnet(
    (phi): Tanh()
    (stack): ModuleList(
      (0): Linear(in_features=5000, out_features=3000, bias=True)
      (1): Block(
        (L1): Linear(in_features=3000, out_features=3000, bias=True)
        (L2): Linear(in_features=3000, out_features=3000, bias=True)
        (phi): Tanh()
      )
      (2): Block(
        (L1): Linear(in_features=3000, out_features=3000, bias=True)
        (L2): Linear(in_features=3000, out_features=3000, bias=True)
        (phi): Tanh()
      )
      (3): Block(
        (L1): Linear(in_features=3000, out_features=3000, bias=True)
        (L2): Linear(in_features=3000, out_features=3000, bias=True)
        (phi): Tanh()
      )
      (4): Linear(in_features=3000, out_features=1460, bias=True)
    )
  )
)

I have noticed that the time for the input layer
Linear(in_features=5000, out_features=3000, bias=True)

have cost time:
1.201575756072998 layer:0
in the meanwhile

0.0006377696990966797 layer:1
0.0002562999725341797 layer:2
0.00022459030151367188 layer:3
7.271766662597656e-05 layer:4

why would this happen?

...
x_input=torch.rand(1,5000).to(device)
pre = model(x_input)
..
    def forward(self, x):
        # first layer
        for i in range(len(self.stack)):
            t1=time.time()
            x = self.stack[i](x)
            t2=time.time()
            print(t2-t1,'layer:{}'.format(i))

#with 4*nvidia rtx 3090

ptrblck · May 26, 2021, 8:50am

CUDA operations are executed asynchronously so you would have to synchronize the code via torch.cuda.synchronize() before starting and stopping the timer, otherwise you would profile e.g. the CUDA context creation, the kernel launches etc.
Also, you should add warmup iterations and measure the average time of multiple iterations.

ZeyuanLee · May 26, 2021, 9:12am

I have changed the code:` def forward(self, x):

    for i in range(len(self.stack)):
        torch.cuda.synchronize()
        t1=time.time()
        x = self.stack[i](x)

        torch.cuda.synchronize()
        t2=time.time()
        print(t2-t1,'layer:{}'.format(i))

and run more iterations to get running time as followed:

iteration: 0
1.0876386165618896 layer:0
0.0024077892303466797 layer:1
0.002411365509033203 layer:2
0.002465963363647461 layer:3
0.002351045608520508 layer:4
1.097477674484253 total
iteration: 1
0.002420186996459961 layer:0
0.0024178028106689453 layer:1
0.002460479736328125 layer:2
0.0024292469024658203 layer:3
0.0023517608642578125 layer:4
0.01389312744140625 total
iteration: 2
0.0024149417877197266 layer:0
0.002396106719970703 layer:1
0.0003743171691894531 layer:2
0.00017976760864257812 layer:3
7.557868957519531e-05 layer:4
0.0074841976165771484 total
iteration: 3
0.0002980232238769531 layer:0
0.0004563331604003906 layer:1
0.0024309158325195312 layer:2
0.0023963451385498047 layer:3
0.0023696422576904297 layer:4
0.008256196975708008 total
...

I am wondering is there something different with the so-called warm-up iteration?