The problem is I wanted to cat two tensors but since they don’t match I don’t have the idea how to cut tensor x_skip to match the spatial dimensions of tensors x.

This is a part of solving the Unet architecture that takes any input size and outputs the feature map of the input size.

To answer your specific question: resize_() doesn’t do what you would
want it to – it will mess up the two-dimensional structure of your image.
You can just slice into x_skip to trim off the excess of one row and column
of pixels:

x = torch.cat ((x, x_skip[:, :, :x.shape[2], :x.shape[3]), dim = 1)

(Of course, the x and trimmed x_skip might not align ideally.)

Two comments (in the context of the original U-Net paper, rather than that
of the link you cite):

When you perform U-Net’s “up-convolution,” you reduce the number of
“features” (the dimension of your x tensor that is of size 512) by a factor
of two. So your x and x_skip should have the same number of features,
which I assume should be 256. Check where x is coming from and why
it appears to have twice as many “features” as it should.

In my view, the cleanest way to get the height and width of x and x_skip
to match is to only input into the U-Net images whose height and width are
such that as the “image” progresses through the U-Net the intermediate
heights and widths are always evenly divisible by two when passed through
each 2x2 max-pooling downsampling step. This way, you’ll never have to
pad or trim x or x_skip to get them to match in height and width (as they
need to when you cat() them).

In order to input images of such “self-consistent” heights and widths, you
can either choose tiles of such sizes, if you’re using a tiling strategy, or
you can pad the input image up to such a self-consistent size. (The U-Net
paper recommends reflection padding which they call mirroring.)

Re the features, I am not highly opinionated at the moment, but it seams to me like not a problem to add any number of features as long as spatial dimensions match. What is the possible idea that equal number of features may be better?

Trimming strategy for U-Nets you just invented looks like a solution at the moment, self consistent size images are those where spatial dimension is divisible by 8. Can you clarify what is “choose tiles of such sizes” idea?