Why does Alexnet in torch vision use Average Pooling

Got it — In the paper, they explicitly state that they only used (max) pooling layers with a kernel_size=3 and stride=2, but I am unsure how the author’s made it work.

I have been trying to recreate the model as described in the paper, and the transition from the convolution to the dense layers is a roadblock I haven’t been able to resolve yet. I am probably missing something basic and would appreciate your input!