Graph neural networks for node classification

agma · May 10, 2020, 7:31pm

Hi everyone,

I am using a GCN model to perform node classification.
The implementation of the GCN model was found in the following repo: “https://colab.research.google.com/github/zaidalyafeai/Notebooks/blob/master/Deep_GCN_Spam.ipynb#scrollTo=Hhabp4QvoP6V”
The issue is that the accuracy I obtain is about 0.22 and the model does not seem the learn from the graph data. A random classifier would indeed give an accuracy close to 0.2 since there are 5 classes in my data.

The implementation of the GCN model is:

class Net(nn.Module):
def init(self):
super(Net, self).init()
self.conv1 = SplineConv(1, 16, dim=1, kernel_size=5)
self.conv2 = SplineConv(16, 32, dim=1, kernel_size=5)
self.conv3 = SplineConv(32, 64, dim=1, kernel_size=7)
self.conv4 = SplineConv(64, 128, dim=1, kernel_size=7)
self.conv5 = SplineConv(128, 128, dim=1, kernel_size=11)
self.conv6 = SplineConv(128, 5, dim=1, kernel_size=11)
self.dropout = 0.25

def forward(self, batch):
x, edge_index, edge_attr = batch.x, batch.edge_index, batch.edge_attr
# batch = batch.batch
x = F.elu(self.conv1(x, edge_index, edge_attr))
x = self.conv2(x, edge_index, edge_attr)
x = F.elu(self.conv3(x, edge_index, edge_attr))
x = self.conv4(x, edge_index, edge_attr)
x = F.elu(self.conv5(x, edge_index, edge_attr))
x = self.conv6(x, edge_index, edge_attr)
# x = pyt_geom.global_mean_pool(x, batch)
x = F.dropout(x, training=self.training)
output = F.softmax(x, dim=1)
return output

The rest of the implementation is the same as the code in the repo link.
The graph data has the following format:
“Batch(batch=[43267], edge_attr=[475194, 1], edge_index=[2, 475194], x=[43267, 1], y=[43267])” where the edge attributes and node features are set to 1. They were set to 1 because the only data I have is the edge list of the network.

Thank you for your help!

ptrblck · May 11, 2020, 2:25am

The code in the linked notebook uses F.nll_loss with F.log_softmax as the last activation in the model, which is correct.
Your model seems to use F.softmax, so could you change it to F.log_softmax and rerun the code, please?

agma · May 11, 2020, 11:14am

Yes, thank you very much for pointing that out. However my code still gives very low accuracy and the results are hard to interpret. Here’s the output from the GCN model:

I believe that the issue is in the way I pre-process the graph data. I have 5 graphs and for each one I convert the strings in the edge list into unique integer identifiers using a dictionary. At the end I feed the list into a DataLoader to have one large graph that includes the 5 subgraphs.

Also, I am not sure how to split the data for training and testing. Taking 80% for training and 20% for testing gives a testing accuracy of 0. In here, I split the data as follows:

data.train_mask = torch.zeros(data.num_nodes, dtype=torch.bool)
data.train_mask[:int(0.8 * data.num_nodes)] = 1 #train on the first 80% nodes
data.test_mask = torch.zeros(data.num_nodes, dtype=torch.bool) 
data.test_mask[- int(0.8 * data.num_nodes):] = 1 #test on the last 80% nodes

The pre-processing implementation for the 5 graphs is the following:

graph_list = []
label_index = 0

for graph in input_list:

    edges = np.vstack((np.array(graph["gene1"]),np.array(graph["gene2"])))
    edge_attr = np.ones(edges.shape[1])
    edge_attr = torch.as_tensor(np.expand_dims(edge_attr, 1), dtype=torch.float)

    node_features = np.unique(edges)
    number_of_nodes = node_features.shape[0]
    node_indices = []
    for i in range(number_of_nodes):
        node_indices.append(i)

    dictionary = dict(zip(node_features, node_indices))
    edge_row_1 = [dictionary[node_feat] for node_feat in edges[0]]
    edge_row_2 = [dictionary[node_feat] for node_feat in edges[1]]

    edge_ind = np.vstack((edge_row_1,edge_row_2))
    edge_index = torch.as_tensor(edge_ind, dtype=torch.long)

    x = torch.ones(number_of_nodes)
    x = torch.as_tensor(np.expand_dims(np.array(x), 1), dtype=torch.float)

    y = torch.ones(number_of_nodes)*label_index
    y = torch.as_tensor(y, dtype=torch.long)
    label_index = label_index+1

    graph_data = Data(x=x, y=y, edge_index=edge_index, edge_attr=edge_attr)
    graph_list.append(graph_data)

loader = DataLoader(graph_list, batch_size=5, shuffle=True)
batch = next(iter(loader))

Please let me know if you find anything else that is incorrect.
Thank you.

ptrblck · May 11, 2020, 10:41pm

I’m not deeply familiar with graph networks, but an accuracy of 0% sounds wrong.
In the worst case, your model should at least yield the random accuracy. E.g. for a multi-class classification use case with 10 classes, you should at least get 10% accuracy.

Could you try to create a tiny training dataset and overfit it with your model?

agma · May 12, 2020, 10:46am

Okay I will try that. Thank you for your advice!

agma · May 15, 2020, 5:25pm

Hi Ptrblck,

I have tried to train the model on a smaller network dataset (5 classes) but I still get a testing accuracy of 0% (rounded to 0 but there are a few correct predictions) and training accuracy of 51%. In the first iterations both the training loss and testing loss decrease, and after 20 epochs the testing loss starts to increase again leading to a testing accuracy of 0%. When I inspect the predictions, I found that predictions.unique() often returns 2 or 3 labels, instead of 5.
For instance labels = [000111222333444] and predictions = [444444444000000] where predictions.unique() = [4,0].

Do you have any other idea of what could be wrong?
Thanks again for your help.

ptrblck · May 16, 2020, 8:52am

If your validation (or testing) loss increases while the training loss decreases, your model is overfitting and you would have to add more regularization.
Were you able to completely overfit the small dataset, i.e. did your model reach ~100% accuracy on the training set?

agma · May 17, 2020, 9:47am

Yes the divergence between training loss and testing loss looks like an overfitting scenario. However the training accuracy was only 51%. I guess the issue must come from the pre-processing of the data with Pytorch geometric Data loaders.
Thanks for your help.

Thecode_geek8810 · August 30, 2021, 1:12pm

Can some one please explain how the message passing part of GCN will change if I use that for edge classification instead of node classification? Apologies if that’s a basic question, I am new to pytorch and Graph neural networks.