Pytorch lightning with GPU

I am trying to follow the official doc Accelerator: GPU training — PyTorch Lightning 1.7.0dev documentation to use gpu to train. There is basic, intermediate and advanced level tutorial in the doc. I am only following the basic one.

there is only two changes to be made in the tutorial:
1st change from

trainer = pl.Trainer(max_epochs=20)

to

trainer = pl.Trainer(max_epochs=20, accelerator='gpu', devices=1)

2nd change use .type_as behind every newly created tensor e.g. from

class GetContextVector(pl.LightningModule):
    def __init__(self):
        super(GetContextVector, self).__init__()

    def forward(self, memory, attentions):
        attentions = F.softmax(attentions, dim=-1)
        output = torch.tensor([])
        .......

to

class GetContextVector(pl.LightningModule):
    def __init__(self):
        super(GetContextVector, self).__init__()

    def forward(self, memory, attentions):
        attentions = F.softmax(attentions, dim=-1)
        output = torch.tensor([]).type_as(attentions)
        .......

These two are the only changes I made.
Then I decided to log the time taken between it taken to perform one training step and the time taken to perform one full epoch. I also did it with each validation step.

    def training_step(self, batch, batch_idx):  # batch should contain both x, y label
        tic = time.perf_counter()


        if batch_idx == 0:
            if self.epochStartTime is None:
                self.epochStartTime = time.perf_counter()
            else:
                a = time.perf_counter()
                wandb.log({"epochTime": a - self.epochStartTime})
                self.epochStartTime = time.perf_counter()

        ..... compute loss, backpropagation, etc. 


        toc = time.perf_counter()
        wandb.log({"oneStepTime": toc - tic})
        return {'idx': batch_idx}

The results are
Not using gpu:
one training step: 0.14 seconds
one validation step: 0.05
one epoch: 60 seconds

using gpu:
one training step: 0.2 seconds
one validation step: 0.15 seconds
one epoch: 120 seconds

So using gpu actually makes things a lot slower in this case.
For each epoch there is 100 training step and 100 validation step:
(0.2 - 0.14) x 100 + (0.15 - 0.05) x 100 = 16 seconds.
It seems that the validation step and training steps are not the only computation that is slowed down because the epoch step slowed down by 60 seconds instead of 16.

What have I done wrong?