Vision Transformer for image classification

Hello! I’m implementing a ViT to be applied to (76,50,50,116) images where the possible classes are [0,1,2,3]. When I run the script with a reduced dataset (ex. 5 images) it works, but when I use the entire dataset the “IndexError: too many indices for tensor of dimension 1” appears in the following part:

class MyMSA(nn.Module):
 def __init__(self, dim, n_heads=2):
    super(MyMSA, self).__init__()
    self.dim = dim
    self.n_heads = n_heads
    assert dim % n_heads == 0, f"Can't divide dimension {dim} into {n_heads} heads"
    d_heads = int(dim / n_heads)
    self.q_map = nn.ModuleList([nn.Linear(d_heads, d_heads) for _ in range(self.n_heads)])
    self.k_map = nn.ModuleList([nn.Linear(d_heads, d_heads) for _ in range(self.n_heads)])
    self.v_map = nn.ModuleList([nn.Linear(d_heads, d_heads) for _ in range(self.n_heads)])
    self.d_heads = d_heads
    self.softmax = nn.Softmax(dim=-1)

 def forward(self, sequences):
    result = []
    for sequence in sequences:
        seq_result = []
        for head in range(self.n_heads):
            q_map = self.q_map[head]
            k_map = self.k_map[head]
            v_map = self.v_map[head]
            seq = sequence[:, head * self.d_heads: (head + 1) * self.d_heads]  # Error here
            q, k, v = q_map(seq), k_map(seq), v_map(seq)
            attention = self.softmax(q @ k.T / (self.d_heads ** 0.5))
            seq_result.append(attention @ v)
        result.append(torch.hstack(seq_result))
    return torch.cat([torch.unsqueeze(r, dim=0) for r in result])

I use a batch size of 4. Does anyone have an idea of the cause and how to solve it?

Could you check which line of code raises the error and post the shapes of all used tensors, please?

The error is at:

seq = sequence[:, head * self.d_heads: (head + 1) * self.d_heads]

The images are(76, 116, 50, 50), the labels are (76,). During the training loop the images and the labels become (4, 116, 50, 50) and (4,) respectively, while y_predicted is (4, 4). k, q, v and seq are (2501,2), sequence is (2501,4).
As loss I used CELoss so I transformed the labels this way:

labelSet = torch.from_numpy(labels['a']).squeeze().long() - 2  # from (2,3,4,5) to (0,1,2,3)

Also, I noticed that when I use for example 8 images the error appears, while with 12 images it works.
To solve I used:

sequence = sequence.reshape(-1, 4)

But then I get RuntimeError: CUDA out of memory

Thanks for the posting the line of code. Based on the error message it seems sequence contains a single dimension for unknown reasons during the training.
Also, this sounds a bit weird:

as it seems you are changing the batch size from 76 to 4 at one point. Could you explain if this is expected and how to interpret the change in the number of samples in the batch?

Thanks to you! Actually, the images and the labels are 76 during the data loading (entire dataset), then I used torch.utils.data.DataLoader with a batch_size of 4 so during the training it takes 4 images at a time. I think it sholud be ok. An other thing is that I used 3 convolutional layers in the model to reduce the number of channels from 116 to 10.

Ah OK, yes the batch size is expected to be smaller and I misunderstood the post thinking 76 would already be the entire batch size.
Could you print the shape of sequence during the training and see how it changes?

Yes, effectively sequence changes from (2501, 4) to (4,) during the training. I don’t understand the reason.

Based on your code you are passing sequences to your model’s forward method and are then iterating this object so check where sequences comes from and how its shape is defined.

Actually, I made a really stupid error by setting the training size not multiple of the batch_size. You helped me to arrive to the critical problem so thank you very much!

1 Like