Torch.jit.trace receives fixed size input to track operations. But How traced modules handle different batch size?

we can convert pytorch modules to torchscript with torch.jit.trace().
and this trace function needs fixed size input to track operations of the module.

However, the last batch size can be different from the batch size if the number of samples is not multiple of the batch size.
ex) 118 samples with 32 batch size => 32/ 32 / 32 / 22

Then how does this traced module handle different batch size?
I thought it would raise an error but traced module could inference with different batch size input.

Also, when I use larger batch size, I could see that inference speed becomes slow.
I am using CNN model and I usually use 201 samples.
I calculated the model inference time using the below code and could see that 16 batch size (6 seconds average in my example) makes the inference speed faster than 128 batch size (8 seconds average). (1 batch size results in 15 seconds)

            torch.cuda.synchronize()
            start = time.perf_counter()
            with torch.no_grad():
                output = torch_model(sample_batch_tensor)
            torch.cuda.synchronize()
            stop = time.perf_counter()
            curr_time = stop - start

I am using cuda parallel computation so I thought larger batch size will accelerate the model inference speed. I didn’t include model or data loading time. Is this also related to Tracing module with fixed size?