Import video in form of an numpy array in pytorch

You could split the outputs into the regression problem (mouse coordinates) and the classification problem (click/no-click).

Both outputs should be passed to the appropriate loss function.
Here is a very simple example code you could use as a starter:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 3, 1, 1)
        self.pool1 = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(6*12*12, 20)
        
        self.fc2a = nn.Linear(20, 2) # Regression
        self.fc2b = nn.Linear(20, 2) # Classification
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        
        x1 = self.fc2a(x)
        x2 = F.log_softmax(self.fc2b(x), dim=1)
        return x1, x2


model = MyModel()
criterion1 = nn.MSELoss()
criterion2 = nn.NLLLoss()

x = torch.randn(1, 1, 24, 24)
target1 = torch.empty(1, 2).random_(2000)
target2 = torch.empty(1, dtype=torch.long).random_(2)

output1, output2 = model(x)
loss1 = criterion1(output1, target1) / 2000**2 # Scale loss
loss2 = criterion2(output2, target2)
loss = loss1 + loss2
loss.backward()

There are several ways to deal with your problem, and this is just one possible approach.
Let me know, if this works for you.