How to use collate_fn()

juhyung · October 13, 2018, 1:16pm

Hi,
I am not sure with what collate_fn does.
is there any example that helps understanding what it does?

ptrblck · October 13, 2018, 3:13pm

You can use your own collate_fn to process the list of samples to form a batch.
The batch argument is a list with all your samples. E.g. if you would like to return variable-sized data, have a look at this thread.

Brando_Miranda · July 17, 2019, 5:25pm

so as ptrblck said the collate_fn is your callable/function that processes the batch you want to return from your dataloader. e.g.

    def collate_fn(batch):
        print(type(batch))
        print(len(batch))

in my case of batch_size=4 will return a list of size four. Lets check it:

<class 'list'>
4

Paulo_Mann · July 18, 2019, 1:06am

I have recently answered some other post with a similar question. But basically, the collate_fn receives a list of tuples if your __getitem__ function from a Dataset subclass returns a tuple, or just a normal list if your Dataset subclass returns only one element. Its main objective is to create your batch without spending much time implementing it manually. Try to see it as a glue that you specify the way examples stick together in a batch. If you don’t use it, PyTorch only put batch_size examples together as you would using torch.stack (not exactly it, but it is simple like that).

The following code I wrote on this post should help you grasp the real understanding. It pads sequences with 0 until the maximum sequence size of the batch, that is why I need the collate_fn, because a standard batching algorithm (simply using torch.stack) won’t work in my case, and I need to manually pad different sequences with variable length to the same size before creating the batch.

Ximing_Dong11104 · July 24, 2021, 8:45pm

where is the parameter ‘batch’ come from?

victorvargass · August 3, 2021, 2:43pm

Batch parameter is implicit in the Dataloader function.
You just have to pass the collate function name to collate_fn Dataloader parameter.

Here’s an example:

loader_collate = DataLoader(
    dataset, shuffle=True, batch_size=5, collate_fn=collate_fn)

RachelShalom · August 10, 2021, 1:58pm

very cool example @Paulo_Mann , also possible to it with pytorch pad_sequences function ( which I am not sure was there in 2018:))

knosing · November 24, 2022, 7:04am

Thanks, this helped me. However, I’m still wondering what gets passed in batch parameter of collate_fn?

JuyiLin · October 19, 2023, 1:17pm

I am wondering when we call collate_fn
for batch in loader:, it will call collate_fn once or n times?

chenqiyuan1012 · April 23, 2024, 1:55am

The collate_fn will be called every batch