where 32 is the batch size, 64 is a sequence length and 11 is the number of features. each sample of mine is 64X11, and has a label of 0 or 1.

I’d like to predict when a sequence has a label of “1”.

I’m trying to use a simple architecture with

conv1D → ReLU → flatten → linear → sigmoid.

For the Conv1D I thought that since it is a multi variate time series prediction, and each row in my data is a second, I think that the number of in channels should be the number of features, since that way it will process all of the features concurrently, (I don’t have any spatial things in my data, it doesn’t matter if a column is in index 0 or 9, as it is important in image with pixels.

I can’t get to decide how to “initialize” the conv1D parameters. Currently I think the number of channels should be the number of features and not 1, as the reason I just explained, but unsure of it.

Secondly, should the loss function be BCELOSS or something else? assuming that my labels are 0 or 1, and the prediction for me is I want the model to provide a probability of belonging to class with label 1.

This is reasonable (for a very simple architecture), but get rid of the sigmoid() (see below).

Yes, this makes sense. Note that Conv1d expects its input to have
shape [nBatch, nChannels, length], so you will want to swap
the last and second-to-last dimensions of your data before feeding
it to Conv1d.

When you instantiate Conv1d, pytorch initializes its weight and bias
randomly. Start with that, an only do something different if you have
a good reason.

[quote]
currently I think the number of channels should be the number of features and not 1, as the reason I just explained, but unsure of it.
[\quote]

Your in_channels should be your number of input features (11).

Assuming that you are asking about out_channels, that’s more of
a judgment call. Consider using something larger than in_channels
(and then let your final Linear figure out to combine them back
together).

You have a binary-classification problem, so binary cross entropy is
appropriate. However, for reasons of numerical stability, you should
use BCEWithLogitsLoss, but without a subsequent sigmoid() (as BCEWithLogitsLoss computes log_sigmoid() internally).

Without the final sigmoid(), your model will predict the logit that
corresponds to your desired probability. You probably don’t actually
need the probability. (Logits are the correct input to BCEWithLogitsLoss.)
If you need the actual probabilities (and you probably don’t), convert
your logits to probabilities by passing them through sigmoid(), but
do this after you compute your loss function (and detached from the
computation graph).

(As an aside, please don’t post screenshots of textual information. It breaks
accessibility, searchability, and copy-paste.)

Quoting from my previous post:

What I meant by this is

Pass the logits (no sigmoid()) to your loss function. If you then need
to convert the logits to probabilities, do so (by applying sigmoid()) after
calling your loss function.

Since you won’t be backpropagating through the sigmoid() call (because
it comes after your loss function), you can gain a little efficiency by turning
off autograd tracking by using .detach() or by wrapping the sigmoid()
(and any subsequent processing) in a with torch.no_grad(): block.

Logits contain the same information as probabilities (but are numerically
better behaved). What you want – at least for training purposes – is for
your model to give you a logit. You do not need a probability for calling
your loss function. If you need a probability for some other reason – and
ofttimes you don’t – take the logit that is output by your model and pass it
through a sigmoid().