I came across this paper called, Inertial-based Activity Recognition with Transformers, where authors have utilized CNN + Transformer encoders + FCs to classify human activities from sensor data.
I deep-dived into the code implementation and found they are using Convolutional layers of kernel size 1x1. Can anyone explain why the author chose 1x1 kernel-size convolutional layers? What are the significance or effects of these in the model?
To my understanding, the Conv layers with 1x1 kernel size generate new high-level features incorporating data from all channels without changing the step sizes.
Am I right here? Can someone explain the significance or effects of using 1x1 kernel size in such a scenario of human activity recognition from sensor data?