Attempting to reduce memory consumption by fusing Conv1d with Maxpool1d

Currently I am working on a speech recognition project. I am currently experimenting with SincNet ( architecture for extracting speech features. SincNet layer essentially a Conv1D with some kernel window param optimization. In this architecture the conv layer is followed by a max pool layer.

Here is a problem I am facing. The input for SinceNet is a waveform signal with 1 channel. I have about 15 seconds of wav with 16K bitrate, which amounts to 24K params. The SincNet produces 64 channels (window size and strides are 1). Having batch size of 32 would explode the number of params to 49M. The max pool layer would reduce this number of params but I would like to max pooling immediately after convolution is applied for each channel.

What is your recommended take on this problem?

One obvious solution is to operate plane by plane (channel) and stack the results. However, I am not familiar with Pytorch’s memory management and async nature of Cuda operations.