Collate function is not called in dataloader

alireza_samadi · February 3, 2023, 10:41am

I have a collate function as below:


def collate_mydataset(samples):
    print("hello collate function")
    print(samples)
    num_nodes_list = [data[0].size(0) for data in samples]
    max_num_nodes = max(num_nodes_list)
    num_edges_list = [data[2].size(0) for data in samples]
    max_num_edges = max(num_edges_list)
    features_list = [data[0] for data in samples]               #node features
    edge_indices_list = [data[1] for data in samples]          
    edge_features_list = [data[2] for data in samples]
    graph_labels_list = [data[3] for data in samples]
    m_list = [data[4] for data in samples]
    

    features_padded = []
    for feature in features_list:
        num_nodes = feature.shape[0]
        if num_nodes < max_num_nodes:
            padding = torch.zeros((max_num_nodes - num_nodes, feature.shape[1]))
            features_padded.append(torch.cat([feature, padding], 0))
        else:
            features_padded.append(feature)
    features = torch.stack(features_padded, dim=0)
    

    edge_indices_padded = []
    for edge_indices in edge_indices_list:
        num_edges = edge_indices.shape[1]
        if num_edges < max_num_edges:
            padding = torch.zeros((2, max_num_edges - num_edges))
            edge_indices_padded.append(torch.cat([edge_indices, padding], 1))
        else:
            edge_indices_padded.append(edge_indices)
    edge_indices = torch.stack(edge_indices_padded, dim=1)
    
    
    edge_features_padded = []
    for e_feature in edge_features_list:
        num_edges = e_feature.shape[0]
        if num_edges < max_num_edges:
            padding = torch.zeros((max_num_edges - num_edges, e_feature.shape[1]))
            edge_features_padded.append(torch.cat([e_feature, padding], 0))
        else:
            edge_features_padded.append(e_feature)
    edge_features = torch.stack(edge_features_padded, dim=0)



    graph_labels = torch.stack(graph_labels_list, dim=0)
    
    m_padded = []
    for m in m_list:
        num_nodes = m.shape[0]
        if num_nodes < max_num_nodes:
            padding = torch.zeros((max_num_nodes - num_nodes, m.shape[1]))
            m_padded.append(torch.cat([m, padding], 0))
        else:
            m_padded.append(m)
    m = torch.stack(m_padded, dim=0)

    
    return [features, edge_indices, edge_features, graph_labels, m]

when I pass two samples of my graph dataset as:

out=collate_mydataset(tu_dataset[0:2])

it prints the out put and works fine

but when I pass the function in dataloader :

from torch_geometric import loader
torch.manual_seed(42)
batch_size=10
div_threshold = int(tu_dataset.__len__()*0.8)
train_dataset = tu_dataset[: div_threshold ]
test_dataset = tu_dataset[int(tu_dataset.__len__()*0.8):]

train_loader =loader.DataLoader(train_dataset, batch_size=batch_size, shuffle=False,collate_fn=collate_fn)

it doesn’t even call the function and says
stack expects each tensor to be equal size, but got [32, 9] at entry 0 and [15, 9] at entry 1

which in this case is saying it can’t stack node features of two graphs ( but if it calls the function this problem should be solved because in the function it handles padding and stacking )

what could be the issue?

ptrblck · February 4, 2023, 2:49am

This seems to be expected, since the PyTorch-Geometric DataLoader implementation will delete your custom collate_fn and replace it with their Collater class as seen here.

alireza_samadi · February 6, 2023, 8:48am

yes , the error directs me to the same part of the code, but i do not know how should the collate function be structured ,
and actually the main question is what should i do so that when the custom collate function is replaced, it works fine.
PS: in other resources (on text data , not graph) they were simply passing custom collate function’s name as i have done here.

alireza_samadi · February 6, 2023, 10:06am

train_loader =loader.DataLoader(train_dataset, batch_size=batch_size, shuffle=False,collate_fn=collate_mydataset)

ptrblck · February 6, 2023, 6:16pm

I don’t know why PyG replaces the custom collate_fn, but @rusty1s would know.

Yamadron · September 22, 2023, 3:04am

I know this an old thread, but it is interesting because I stumbled upon the same problem until I found the answer here. At this point what are the benefits of using the DataLoader from pyg over the DataLoader from torch?
@rusty1s

ArsamAryandoust · January 19, 2025, 9:27am

You can use the regular Pytorch Data Loader and mimic Torch Geometric’s colalte_fn(), then customize it.

from torch.utils.data import Dataset, DataLoader
from torch_geometric.data import Data as GNNData
from torch_geometric.data import Batch

# your data preparation here
# create a list of GNNData objects as your e.g. train_dataset

train_dataset = DataLoader(
    train_dataset,
    batch_size=batchsize,
    collate_fn=collate_fn_gnn, 
    drop_last=True,
    shuffle=True,
    num_workers=NumWorkers,
    pin_memory=PinMemory
)

def collate_fn_gnn(data_list):
    """Create batched data for processing with graph neural network."""

    # do the torch geometric batching. Returning this is the equivalent to 
    # native pytorch geometric Data Loading without a custom collate_fn().
    batch = Batch.from_data_list(data_list)
    
    # do all your custom collate_fn() implementations you intended to do here.
    # add them to batch data object.

    return batch