I was just wondering what images really are? I googled a lot but couldn’t find the right answers i wanted to know, so i would appreciate if you could either give me a detailed insights or links that explain about the following questions i have about images? when working with images in deep learning, the images are represented as numbers in a certain number of tensor dimensions. so based on that,

what is the basics behind changing images to numbers or how do we change images to tensor of numbers

what are pixesl and are they fixed in an image like for a certain image, we have a fixed number of pixels or they can change based on the amount of resolution we want?

if pixels are the smallest building blocks of an image, what do we mean by a pixel has a certain number of channels especially for images with more than 3 number of channels? i know images with 3 channels, the channels are r,g,b but i am not sure what the channels are, for example for an image of 36 channels?

how do we measure the size of an image and number of bits required for images?
thanks in advance

This is a very general question but I’ll answer it assuming your intention is to do computer vision, and in particular to use PyTorch.

An image is a 2d matrix of pixels, shaped like height x width, and each pixel is either a single number if the image is black and white, or a set of 3 numbers if the image is in color. In the color image case, you can therefore think of the image as a 3d matrix (you can call it a “tensor”) shaped like 3 x height x width.

The value of one pixel can be represented differently depending on convention, but in PyTorch it goes between 0.0 and 1.0, where 0.0 is “none of that color” and 1.0 is “maximum of that color”. The order of colors is RGB as you said, meaning red, green, blue. So for example a pixel with a value of (1, 0, 0) would represent a fully red pixel, with no green and no blue. (0, 0, 0) is none of any color, meaning black, and (1, 1, 1) is most of every color, meaning white. In other frameworks the value of a pixel can be represented as an integer between 0 and 255, where 0 is the minimum and 255 is the maximum of a certain color.

There is no single answer to this as there are different libraries which handle things a bit differently. Common Python libraries for image manipulation are PIL/Pillow, which have a special Image class and OpenCV which treats the image like an array with a slightly different convention than what I said above (BGR instead of RGB) but the same underlying idea. PyTorch works with tensors, which are basically arrays with some extra frills needed to train models, and it also has support for handling PIL Images. PyTorch expects its image tensors to have channel before dimension, so their shape is 3 x height x width. You can change from a PIL Image to a tensor via the operator torchvision.transforms.ToTensor(), and you can change from a regular numpy array to a tensor by just wrapping the array like torch.Tensor(array).

So below we generate a random color image of 224 x 224, and then convert it to PIL and finally to a Tensor:

numpy_image = np.random.rand(224, 224, 3)
pil_image = Image.fromarray(np.uint8(255 * numpy_image))
# pil image --> tensor
tensor = torchvision.transforms.ToTensor()(pil_image)
# numpy array --> tensor
tensor = torch.Tensor(numpy_image).permute(2, 0, 1) # permute goes from h, w, c to c, h, w

(nb: the two methods won’t produce exactly the same tensor, for reasons, but close enough)

The pixels are fixed in number. You can enlarge or shrink a picture, known as up-sampling or down-sampling, but that means you are changing the number of pixels so you have to do something to decide how you want to generate new pixels, or to “compress” existing pixels. Common methods include interpolation and averaging.

An image of 36 channels is not really a typical image, so I’m not sure what you mean by that. Some images can have a 4th channel called “alpha”, which is basically like a mask (to allow for part of the image to be transparent against the background). A tensor, however, can have an arbitrary shape. But in that case the tensor ceases to be an image per se and becomes a more abstract object, which may be derived from the image in some way but is no longer literally an image. For example, you can get 36 channels by stacking 12 slightly modified copies of the same image along the channel dimension (think stacking 12 3-inch pancakes to make a 36-inch stack).

You count the pixels and then multiply! See here for some details.

so what i meant by 36 channels is that when working with convolutional neural network, the output channels in the middle layers usually have more than 3 channels. so i thought they have a certain meaning. so are the intermediate channels in CNN(which are usually more than 3) just stack of 3 channels if I understand your explanation in 3? thank you for your concise and detailed explanation. I really appreciate it.

I guess the pancake analogy was confusing no, they aren’t just stacks of 3 color channels.

In the belly of a CNN, the channels typically encode certain features. If an intermediate layer is a tensor of shape 36 x 224 x 224, then you can sort of think about each of the 36 channels as a 224 x 224 heat map of the likelihood of a particular feature being centered at each pixel. The features might represent things like edges, shapes, and more complicated patterns, which depend on your network and your data. Here’s a fun visualization of what they could be.

Typically you’ll find that as the feature channel gets bigger, the other two dimensions of the tensor shrink until they become 1 (so what starts off as 3 x 224 x 224 might become 128 x 32 x 32 in some intermediate step, and finally might become 1024 x 1 x 1). So, gradually, details about the positioning of features in the image may get lost, and replaced by a higher level representation of what the image contains (e.g. what starts off as “a vertical edge in the top left corner, a horizontal edge in the top right corner. a circle in the middle… etc” eventually becomes “a cat”)