I want to construct a neural network that takes an m-dim vector and outputs an n-dim vector. I have 100 training data. To prepare my dataset, shall I make an array/tensor of dimension 100 by m or m by 100 for pytorch? In other words, I want to know whether pytorch takes one data point as a row vector or a column vector. This corresponds to whether the function xW+b or Wx+b is used in each neuron, I guess.

The input to nn.Linear is expected to be [batch_size, features] (or [batch_size, *, features] for more advanced use cases), so your number of samples should be stored in dim0.

In case you are building your network from scratch (i.e. not using torch.nn), a way verify how to represent training examples would be by doing a vectorized vs non-vectorized calculation of loss and loss gradients. The code below indicates that representing training examples in the columns of an input tensor, in conjunction with a Wx+b activation calculation, works ok.

"""
Feed forward neural network with 2 inputs and 1 output.Single hidden layer with 3 neurons, tanh activation, no bias.
"""
import torch
import numpy as np
def forward(Weights1, Weights2, X):
"""
Forward fuction for output from NN.
"""
inp = torch.as_tensor(X, dtype=torch.float)
a1 = torch.tanh(torch.mm(Weights1, inp))
a2 = torch.mm(Weights2, a1)
return a2
#Random initialize layer weight matrices for hidden and output layers
W1 = torch.rand(3, 2, dtype=torch.float, requires_grad=True)
W2 = torch.rand(1, 3, dtype=torch.float, requires_grad=True)
# Define the Training Inputs: Features and Labels for m=3 training examples and 2 input features each.
m = 3
X = torch.tensor([[1,2,3],[0.1,0.2,0.3]])
Y = torch.tensor([[3,6,9]])
"""
Vectorized calculation: Pass X as 2X3 tensor to the forward function. Calculate loss for the batch. Backprop the loss
to calculate gradients wrt weights for the batch.
"""
# Compute the loss function
outputs = forward(W1, W2, X)
# Calculate mean square loss
mean_sqr_loss = (outputs - Y).pow(2).sum() / X.shape[1]
print(f"Vectorized Mean Square Error Loss for the batch = {mean_sqr_loss}")
mean_sqr_loss.backward()
print(f"Vectorized Grad of Loss Func w.r.t W1 for the batch = {W1.grad }")
print(f"Vectorized Grad of Loss Func w.r.t W2 for the batch = {W2.grad }")
"""
Non-vectorized calculation: Loop over each training example, passing each column of X at a time to the forward function.
Calculate loss for each training example, then sum and divide by number of examples to get loss for the batch.
Differentiate the expression for mean square loss to compute the gradient of loss w.r.t weights for each training example and the batch.
"""
total_sqr_loss = 0
total_w1_grad = total_w2_grad = 0
w1_grad_list = []
w2_grad_list = []
for i in range(0, X.shape[1]):
this_output = forward(W1, W2, X[:,i].reshape(2,1))
this_sqr_loss = ((this_output - Y[:,i]).pow(2))
total_sqr_loss += this_sqr_loss
# Clear the gradients associated with weights
W1.grad.zero_()
W2.grad.zero_()
# Backprop to get gradients of the output w.r.t. weights for this training example
this_output.backward()
# Compute gradient of loss function from the gradient of output for this training example by differentiating the expression
# for square loss.
this_w1_grad = 2*(this_output - Y[:,i]) * (W1.grad)
this_w2_grad = 2*(this_output - Y[:,i]) * (W2.grad)
# Add to list of gradients for training examples.
w1_grad_list.append(this_w1_grad)
w2_grad_list.append(this_w2_grad)
nonvectorized_mean_sqr_loss = total_sqr_loss / X.shape[1]
# Sum across all training examples and divide by number of examples to get gradient of loss w.r.t weights for the batch.
loss_grad_w1 = sum(w1_grad_list) / X.shape[1]
loss_grad_w2 = sum(w2_grad_list) / X.shape[1]
print(f"Non-vectorized Mean Square Error Loss for the batch = {nonvectorized_mean_sqr_loss.item()}")
print(f"Non-vectorized Grad of Loss Func w.r.t W1 for the batch = {loss_grad_w1}")
print(f"Non-vectorized Grad of Loss Func w.r.t W2 for the batch = {loss_grad_w2}")