Batched custom extension

Hi all,
When writing a custom extension in either Python or C++, does PyTorch abstract the notion of a batch?

For example, given an input Tensor X of dimension (N, H, W) would one implement an operation as such:

class ExampleExtension(tch.autograd.Function):
    def forward(ctx, X):
        # An un-vectorisable operation on each Tensor in the batch.
        # Assumes PyTorch has passsed the nth Tensor of X as (H, W).
        Xp = torch.zeros_like(X)
        for r in range(X.shape[0]): # For each row.
            for c in range(X.shape[1]): # For each column.
                Xp[r, c] = foo(X[r, c])

        # Followed by a vectorisable operation.
        return torch.mm(Xp, X)

    def backward(ctx, grad):
        # Same approach.

In the above example, there is an assumption that PyTorch is invoking ExampleExtension over each Tensor in the batch.

However, I suspect from inspecting dimensions that this is not the case. As such, I would do something like this instead:

class ExampleExtensionBatched(tch.autograd.Function):
    def forward(ctx, X):
        # An un-vectorisable operation on each Tensor in the batch.
        # Assumes PyTorch has passsed the full Tensor X as (N, H, W).
        Xp = torch.zeros_like(X)
        for b in range(X.shape[0]): # For each Tensor in batch.
            for r in range(X.shape[1]): # For each row.
                for c in range(X.shape[2]): # For each column.
                    Xp[b, r, c] = foo(X[b, r, c])

        # Followed by a batch vectorisable operation.
        return torch.bmm(Xp, X)

    def backward(ctx, grad):
        # Same approach.

However, this seems inefficient.

Additionally, when applying the above approach in a C++ and CUDA extension, how would one make use of multiple GPU’s? For example in the case where each (H, W) Tensor in the batch is large and would ideally be processed on a different GPU.

Best,
Jack