Cant get model to train

poe · October 11, 2021, 1:52pm

The use case is the following:
Lets say there are 4 objects having 31 features including feature “Q” and each feature of each object has 8760 values (dataset.dim = [4, 8760, 31]). I would like to train a neural network to estimate Q for any given object as well was new objects having only 30 features (“Q” excluded).

That being said, the feature-tensor should have 4*8760 records and 30 “columns”

and a target-tensor of 4*8760 records and 1 “column”.

>>> features.shape
torch.Size([35040, 30])

Splitting the datasaet into training and validation sets the result creates a feature-training-tensor [38760, 30] and target-training-tensor [38760,1]. With regard to the dimensions I would assume, that the target-tensor is either [38760] and the y_pred is being “squeezed” to one Dimension, or the target-tensor is [38760,1] and no squeezing would be required. Is that assumption correct?

The initialisation of the model das not require the batch size, thus, the input of the input layer has the number of features (30) and thats it.

For each batch i.e. 10 the first 10 records [10,30] out of [3*8760,30] will be selected from the feature tensor and passed to the input layer.
I wonder how the samples of each batch are being processed. The input layer has one neuron for each feature, hence, the samples would need to be processed sequentially, but how? there is no loop in the forward function…

the current model causing the “nan” error on the third training iteration

        self.l0 = nn.Linear(numberOfFeatures, 60)
        self.l1 = nn.Linear(60, 90)
        self.l2 = nn.Linear(90, 30)
        self.l3 = nn.Linear(30, 1)

    def forward(self, inputs):
        output = self.l0(inputs)
        output = F.relu(self.l1(output))
        output = F.relu(self.l2(output))
        return self.l3(output)

batch_size = 10
optimizer = torch.optim.SGD(model.parameters(), lr = 0.05) # lr=1e-3
criterion = nn.BCEWithLogitsLoss()

model.train()
epochs = 5000
errors = []
for epoch in range(epochs):
    optimizer.zero_grad() # sets gradients to zero
    for feature, target in trainset:
        y_pred = model.forward(feature.float())
        loss = criterion(y_pred.squeeze(), target.float())
        errors.append(loss.item())
        print('Epoch {}: train loss: {}'.format(epoch, loss.item()))
        loss.backward()
        optimizer.step()

>>> feature.shape
torch.Size([10, 30])
>>> target.shape
torch.Size([10])
# within for feature, target in trainset: loop

Epoch 0: train loss: 62.11237716674805
Epoch 0: train loss: -1679589769216.0
Epoch 0: train loss: nan

>>> model.forward(feature.float())
tensor([[nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan],
        [nan]], grad_fn=<AddmmBackward>)

Both feature and target tensors do contain float32 data.
A few reasons were provieded here.

I will try normalization first, since there are large values in the data. (source)