How to profile different components in forward pass?

hi, i have

class ToyNetwork(nn.Module):
	def __init__(self, embedding_dim, hidden_dim):
		super(ToyNetwork, self).__init__()
		self.lstm1 = nn.LSTM(embedding_dim, hidden_dim, 70)
		self.lstm2 = nn.LSTM(embedding_dim, hidden_dim, 70)
	
	def forward(self, inputs, hidden1, hidden2):
		start = time.monotonic()
		self.lstm1(inputs, hidden1)
		torch.cuda.synchronize()

		mid = time.monotonic()
		self.lstm2(inputs, hidden2)
		torch.cuda.synchronize()

		print(f"""second network time: {time.monotonic()-mid}, first network time: {mid-start}""")

second network time: 0.010582592338323593, first network time: 0.02081933245062828

im wondering why there are a lot of differences between time taken by first lstm network vs second lstm network

thank you!

You would have to synchronize the code before starting and stopping all timers, while this is not the case for the first call into the LSTM.
Also, I would recommend to use multiple iterations and calculate the average of the operations, as well as to add warmup iterations.
The torch.utils.benchmark utility does this automatically for you and is the recommended way to profile workloads.