Conv1d: on cuda, increase batch size, forward slower

    import torch
    import datetime
    m = nn.Conv1d(16, 512, 3, stride=2).cuda()
    input = torch.randn(20, 16, 24000).cuda()
    for _ in range(40):
        t1 =
        output = m(input)
        t2 =
        print((t2-t1).microseconds / 1000, 'ms')

When the batch size = 20, forward takes 12ms.
When I increase the batch size to 60, the forward takes 26ms.
Is this normal? I think when using GPU, as long as the cuda memory is enough, increasing batch size won’t make forward slower.

That’s not the case as you need to move more data around and are increasing the compute. Could you explain why the runtime should be static?

My understanding may be naive. I think that GPUs can achieve parallel operations between batches, so increasing the batch size does not increase the time required for an iteration, as long as the GPU memory size allows.

This is wrong, since CUDA kernels are written in a way to utilize all compute resources even for smaller batch sizes.

Got it, thank you ~!