How can I change the shape of my train_mask?

anto · September 2, 2022, 7:42am

I’m performing node classification using the WikipediaNetwork dataset. The problem is that the shape of the train_mask is [2277,10] but I want a shape equal to [2277].
The code that I’m using is the following one:

squirrel_dataset = WikipediaNetwork(root='data/WikipediaNetwork', name='squirrel', transform=None)
squirrel_data = squirrel_dataset[0]

def train():
      model.train()
      data = model.data
      optimizer.zero_grad()  # Clear gradients.
      out = model(data.x, data.edge_index)  # Perform a single forward pass.
      loss = criterion(out[data.train_mask], data.y[data.train_mask])  # Compute the loss solely based on the training nodes.
      loss.backward()  # Derive gradients.
      optimizer.step()  # Update parameters based on gradients.
      return loss

The error that I obtain is the following one:

The shape of the mask [2277, 10] at index 1 does not match the shape of the indexed tensor [2277, 5] at index 1

How can I solve this issue?

J_Johnson · September 2, 2022, 8:42am

Can you share the definition for train_mask?

anto · September 2, 2022, 8:45am

The train mask Is automatically generated when I call these lines and is part of the Data Object generated by the class WikipediaNetwork.

squirrel_dataset = WikipediaNetwork(root='data/WikipediaNetwork', name='squirrel', transform=None)
squirrel_data = squirrel_dataset[0]

J_Johnson · September 2, 2022, 10:40am

Suppose we have some tensor of arbitrary number of dimensions for a dummy dataset:

data_size=30000
A=torch.rand((data_size, 200, 5, 20))

Now suppose we create a mask vector of our data_size:

import numpy as np


b=np.arange(data_size)

Randomize it with:

rng = np.random.default_rng()
rng.shuffle(b)

And split it into train and test vectors with a split of 85% to 15%:

a_train=b[:round(0.85*np.shape(b)[0])]
a_test=b[round(0.85*np.shape(b)[0]):]

You could then get your training set by:

train_data=A[a_train,:,:,:]

In other words, for some reason your mask is not only on the first dim, which is the problem. Maybe they used an older version of Pytorch when developing their scripts.

Without seeing their definition for this object call, it’s not possible to identify the error.

anto · September 2, 2022, 2:34pm

This is the source code that I have: torch_geometric.datasets.webkb — pytorch_geometric documentation
I hope can be useful to solve the issue.

J_Johnson · September 5, 2022, 8:04am

Your code seems to have loaded the dataset, and specified what data you want. But I do not see where you have put your data into a dataloader. Please see here:

https://pytorch-geometric.readthedocs.io/en/latest/modules/loader.html#torch_geometric.loader.DataLoader

Copied from their example:

from torch_geometric.datasets import Planetoid
from torch_geometric.loader import NeighborLoader

data = Planetoid(path, name='Cora')[0]

loader = NeighborLoader(
    data,
    # Sample 30 neighbors for each node for 2 iterations
    num_neighbors=[30] * 2,
    # Use a batch size of 128 for sampling training nodes
    batch_size=128,
    input_nodes=data.train_mask,
)

sampled_data = next(iter(loader))
print(sampled_data.batch_size)
>>> 128

Burouj_Armgaan · December 16, 2022, 8:34pm

Hi Antonio!
I ran into the same problem with the WebKB dataset.

So, what’s happening here is that, instead of just one split, the authors of the dataset have made ten random splits of the dataset for train, validation and test nodes. If you print out the data.train_mask, you’ll see that it’s a boolean tensor. If you take data.train_mask[:, 0], you’ll get data as per the 0th random split for train, val, and test nodes. It’s just for randomization.
So, you can do data.train_mask = data.train_mask[:, 0]. Similarly for val and test masks.
Or, you can iterate over all the splits: data.train_mask[:, 0], data.train_mask[:, 1], ... and so on. You’ll then average the metrics you receive over each split.

Hope this helps🙂

anto · December 17, 2022, 6:29am

This solves my problem, thank you! @Burouj_Armgaan