Training model with custom dataset

I have created a custom dataset, and now I am trying to use it to train a model. I’m working off of this tutorial: PyTorch - Training a Convent from Scratch - Tutorialspoint which doesn’t show what X is. I tried passing in my custom dataset, but it expects a Tensor object, whereas my dataset is a list of (Tensor, str) tuples.

How can I feed this into the model cleanly? Bear in mind I’m very new to PyTorch, having just started using it today.

Based on their class definition, input X should be batches of 2d vector, so tensors of size [batch_size,2]. You should check your dataset to be 2d vectors first.

I think I’m rather confused at the moment.

I have a Dataset object. Each entry in the Dataset is a (Tensor, str). Should I change that to produce 2d vectors? I thought a Tensor was essentially a multidimensional vector, and that if I convert I’ll lose the benefits of using them?

From that I constructed DataLoader objects. Still figuring out how those work, but I think they’re used to load data in batches in parallel. Each has its own slice of the parent Dataset. How would I use this to train the model though? I seem to be missing that jump.

The model input should reflect your dataset, in the tutorial, the class Neural_Network is designed for 2d input vectors, hence inputSize = 2. In your dataset, if entries are (Tensor, str), I guest input is the tensor and the label is the str, but the model in the tutorial the output is scalar .

Ideally I want both Tensor and str to be input. It’s possible I’m designing it poorly though.

I have noticed some examples returning (input, label), which I think is what you’re referring to. For now, I’ve changed the Dataset to return (Tensor input, scalar label), and am trying to get it to work. Although I’d still like to add the str as additional input as it should help a lot with classification.

To explain the problem a bit, the Tensor input is the STFT of an audio stream and the str is the text of what is spoken in the audio stream. My initial goal is to learn what the spoken version of the text sounds like to be able to classify audio streams without knowing the text. I think that should be better than training on the audio alone, but I’m happy to be shown I’m wrong.