Trimming with sox_effects becomes continually slower

My data loading process involves extracting chunks from large audio files (approx. 300 MiB), where the chunk positions and durations are given in an extra CSV. For this, I am using sox_effects with the trim effect. An MWE of the kind of code I have is this:

import torchaudio
import time

path = "file.wav"
for t in range(0, 1000, 5):
    t1 = time.time()
    waveform, sampling_rate = torchaudio.sox_effects.apply_effects_file(
        path, [["trim", str(t), "2"], ["rate", "16000"]]
    )
    t2 = time.time()
    print(f"Trim {t}: {t2-t1} s", flush=True)

Unfortunately, the timings go up when we continue in time:

...
Trim 210: 0.0590972900390625 s
Trim 215: 0.07564163208007812 s
Trim 220: 0.06164240837097168 s
Trim 225: 0.06312108039855957 s
Trim 230: 0.08079743385314941 s
Trim 235: 0.0691988468170166 s
Trim 240: 0.06682896614074707 s
Trim 245: 0.08328723907470703 s
Trim 250: 0.07012438774108887 s
Trim 255: 0.0698857307434082 s
Trim 260: 0.09214544296264648 s
Trim 265: 0.07315945625305176 s
Trim 270: 0.07333612442016602 s
Trim 275: 0.08809351921081543 s
Trim 280: 0.07526302337646484 s
Trim 285: 0.0772089958190918 s
Trim 290: 0.09447932243347168 s
Trim 295: 0.07959461212158203 s
Trim 300: 0.07989311218261719 s
Trim 305: 0.09696745872497559 s
Trim 310: 0.07900452613830566 s
Trim 315: 0.08150839805603027 s
Trim 320: 0.1008293628692627 s
Trim 325: 0.08099675178527832 s
Trim 330: 0.08350300788879395 s
Trim 335: 0.10298895835876465 s
Trim 340: 0.0880730152130127 s
Trim 345: 0.08779191970825195 s
Trim 350: 0.10423898696899414 s
Trim 355: 0.08973550796508789 s
Trim 360: 0.09125137329101562 s
Trim 365: 0.1061713695526123 s
Trim 370: 0.09549927711486816 s
Trim 375: 0.0945887565612793 s
Trim 380: 0.1141664981842041 s
Trim 385: 0.09656167030334473 s
Trim 390: 0.09741783142089844 s
Trim 395: 0.12029099464416504 s
Trim 400: 0.09925198554992676 s
Trim 405: 0.1011509895324707 s
Trim 410: 0.12583160400390625 s
Trim 415: 0.1045069694519043 s
Trim 420: 0.10638189315795898 s
Trim 425: 0.12271857261657715 s
Trim 430: 0.10671520233154297 s
Trim 435: 0.10863161087036133 s
Trim 440: 0.12616586685180664 s
Trim 445: 0.11424088478088379 s
Trim 450: 0.11148786544799805 s
Trim 455: 0.13032269477844238 s
Trim 460: 0.11519789695739746 s
Trim 465: 0.1179652214050293 s
Trim 470: 0.12706661224365234 s
Trim 475: 0.11927914619445801 s
Trim 480: 0.11730289459228516 s
Trim 485: 0.13298892974853516 s
Trim 490: 0.12297725677490234 s
Trim 495: 0.12051892280578613 s
Trim 500: 0.1350400447845459 s
Trim 505: 0.1254897117614746 s
Trim 510: 0.1261124610900879 s
Trim 515: 0.13534116744995117 s
Trim 520: 0.12975478172302246 s
Trim 525: 0.1306755542755127 s
Trim 530: 0.1432344913482666 s
Trim 535: 0.1339256763458252 s
Trim 540: 0.13191962242126465 s
Trim 545: 0.14847111701965332 s
Trim 550: 0.138657808303833 s
Trim 555: 0.13456296920776367 s
Trim 560: 0.1497037410736084 s
Trim 565: 0.1382899284362793 s
Trim 570: 0.13804960250854492 s
Trim 575: 0.14861106872558594 s
Trim 580: 0.14386320114135742 s
Trim 585: 0.14127588272094727 s
Trim 590: 0.1532754898071289 s

I didn’t expect this, as I am just trimming to constantly sized windows.

This does not happen when I call sox directly:

for i in {1..1000..5}; do /usr/bin/time -f "Trim $i: %E s" sox file.wav /tmp/out$i.wav trim $i 2 rate 16000; done
...
Trim 916: 0:00.01 s
Trim 921: 0:00.01 s
Trim 926: 0:00.01 s
Trim 931: 0:00.01 s
Trim 936: 0:00.01 s
Trim 941: 0:00.01 s
Trim 946: 0:00.01 s
Trim 951: 0:00.00 s
Trim 956: 0:00.01 s
Trim 961: 0:00.01 s
Trim 966: 0:00.01 s
Trim 971: 0:00.01 s
Trim 976: 0:00.01 s
Trim 981: 0:00.01 s
Trim 986: 0:00.01 s
Trim 991: 0:00.01 s
Trim 996: 0:00.01 s

Why is this? Am I using sox_effects wrong? Are there memory issues, and if so – how to debug them?

Here’s what memory-profiler gives me:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    79    370.6 MiB    370.6 MiB           1       @profile
    80                                             def _run():
    81    370.6 MiB      0.0 MiB           1           path = "../data/grass/conversations/001M002M_001M_HM2_CS_001.wav"
    82    373.6 MiB      0.0 MiB         201           for t in range(0, 1000, 5):
    83    373.6 MiB      0.0 MiB         200               t1 = time.time()
    84    373.6 MiB      3.0 MiB         400               waveform, sampling_rate = torchaudio.sox_effects.apply_effects_file(
    85    373.6 MiB      0.0 MiB         200                   path, [["trim", str(t), "2"], ["rate", "16000"]]
    86                                                     )
    87    373.6 MiB      0.0 MiB         200               t2 = time.time()
    88    373.6 MiB      0.0 MiB         200               print(f"Trim {t}: {t2-t1} s", flush=True)

Doesn’t seem too bad…

Output with torch.profiler:

----------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
----------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                   aten::empty         0.01%       3.058ms         0.01%       3.058ms      15.139us      24.41 Mb      24.41 Mb           202  
                                   aten::zeros         0.04%       9.298ms         0.04%      11.210ms      11.210ms           4 b           0 b             1  
                                   aten::zero_         0.01%       1.877ms         0.01%       1.877ms       1.877ms           0 b           0 b             1  
    torchaudio::sox_effects_apply_effects_file        99.83%       25.307s        99.87%       25.319s     126.596ms      24.41 Mb           0 b           200  
                               aten::transpose         0.03%       6.372ms         0.03%       8.713ms      43.565us           0 b           0 b           200  
                              aten::as_strided         0.01%       2.341ms         0.01%       2.341ms      11.705us           0 b           0 b           200  
                                      trimming         0.08%      20.746ms        99.96%       25.340s       25.340s          -4 b     -24.41 Mb             1  
----------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 25.351s