Hi all,

When writing a custom extension in either Python or C++, does PyTorch abstract the notion of a batch?

For example, given an input Tensor **X** of dimension *(N, H, W)* would one implement an operation as such:

```
class ExampleExtension(tch.autograd.Function):
def forward(ctx, X):
# An un-vectorisable operation on each Tensor in the batch.
# Assumes PyTorch has passsed the nth Tensor of X as (H, W).
Xp = torch.zeros_like(X)
for r in range(X.shape[0]): # For each row.
for c in range(X.shape[1]): # For each column.
Xp[r, c] = foo(X[r, c])
# Followed by a vectorisable operation.
return torch.mm(Xp, X)
def backward(ctx, grad):
# Same approach.
```

In the above example, there is an assumption that PyTorch is invoking *ExampleExtension* over each Tensor in the batch.

However, I suspect from inspecting dimensions that this is not the case. As such, I would do something like this instead:

```
class ExampleExtensionBatched(tch.autograd.Function):
def forward(ctx, X):
# An un-vectorisable operation on each Tensor in the batch.
# Assumes PyTorch has passsed the full Tensor X as (N, H, W).
Xp = torch.zeros_like(X)
for b in range(X.shape[0]): # For each Tensor in batch.
for r in range(X.shape[1]): # For each row.
for c in range(X.shape[2]): # For each column.
Xp[b, r, c] = foo(X[b, r, c])
# Followed by a batch vectorisable operation.
return torch.bmm(Xp, X)
def backward(ctx, grad):
# Same approach.
```

However, this seems inefficient.

Additionally, when applying the above approach in a C++ and CUDA extension, how would one make use of multiple GPU’s? For example in the case where each *(H, W)* Tensor in the batch is large and would ideally be processed on a different GPU.

Best,

Jack