Reasons behind using convolutional layers of 1x1 kernel sizes

I came across this paper called, Inertial-based Activity Recognition with Transformers, where authors have utilized CNN + Transformer encoders + FCs to classify human activities from sensor data.

I deep-dived into the code implementation and found they are using Convolutional layers of kernel size 1x1. Can anyone explain why the author chose 1x1 kernel-size convolutional layers? What are the significance or effects of these in the model?

To my understanding, the Conv layers with 1x1 kernel size generate new high-level features incorporating data from all channels without changing the step sizes.

Am I right here? Can someone explain the significance or effects of using 1x1 kernel size in such a scenario of human activity recognition from sensor data?

Sounds correct to me. Best way is to visualize the output from a single image.
You can try this in a jupyter notebook

from sklearn.datasets import load_sample_image
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
flower = load_sample_image('flower.jpg') 
print(flower.shape)
plt.imshow(flower)
plt.figure()

flower_torch = torch.from_numpy(flower.transpose(2,0,1).astype(np.float32))

conv_layer = nn.Conv2d(in_channels=3, out_channels=3, kernel_size=1, stride=1, padding=0)
output = conv_layer(flower_torch)
print(output.detach().numpy().shape)
plt.imshow(output.detach().numpy().transpose(1,2,0))