What's the standard of Conv2d input?

import torchvision
import numpy as np
from PIL import Image
import torch
model = torchvision.models.resnet50(pretrained=True)
k = Image.open(’/p300/data/images/000b7d55b6184b08.png’)
k = np.array(k)
k = np.expand_dims(k, axis=0)
k = torch.Tensor(k)
result = model(k)
k is an image whose size is [299,299,3]

And the error is RuntimeError: Given groups=1, weight of size 64 3 7 7, expected input[1, 299, 299, 3] to have 3 channels, but got 299 channels instead.

I already have change the k.shape from [299,299,3] to [1,299,299,3].
How the conv2d recognized the input channels?

It should be (N, C, H, W), not (N, H, W, C)

After loading images with Image.open, I just get the result in (C, H, W, C). How can I change it to (N, C, H, W) in a simple way? Because I find if I use the

trans = transform.Compose([ transforms.Resize((299,299)),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])

k = trans(k)

the codes will work. So is there any better way to achieve it?