My data loading process involves extracting chunks from large audio files (approx. 300 MiB), where the chunk positions and durations are given in an extra CSV. For this, I am using sox_effects
with the trim effect. An MWE of the kind of code I have is this:
import torchaudio
import time
path = "file.wav"
for t in range(0, 1000, 5):
t1 = time.time()
waveform, sampling_rate = torchaudio.sox_effects.apply_effects_file(
path, [["trim", str(t), "2"], ["rate", "16000"]]
)
t2 = time.time()
print(f"Trim {t}: {t2-t1} s", flush=True)
Unfortunately, the timings go up when we continue in time:
...
Trim 210: 0.0590972900390625 s
Trim 215: 0.07564163208007812 s
Trim 220: 0.06164240837097168 s
Trim 225: 0.06312108039855957 s
Trim 230: 0.08079743385314941 s
Trim 235: 0.0691988468170166 s
Trim 240: 0.06682896614074707 s
Trim 245: 0.08328723907470703 s
Trim 250: 0.07012438774108887 s
Trim 255: 0.0698857307434082 s
Trim 260: 0.09214544296264648 s
Trim 265: 0.07315945625305176 s
Trim 270: 0.07333612442016602 s
Trim 275: 0.08809351921081543 s
Trim 280: 0.07526302337646484 s
Trim 285: 0.0772089958190918 s
Trim 290: 0.09447932243347168 s
Trim 295: 0.07959461212158203 s
Trim 300: 0.07989311218261719 s
Trim 305: 0.09696745872497559 s
Trim 310: 0.07900452613830566 s
Trim 315: 0.08150839805603027 s
Trim 320: 0.1008293628692627 s
Trim 325: 0.08099675178527832 s
Trim 330: 0.08350300788879395 s
Trim 335: 0.10298895835876465 s
Trim 340: 0.0880730152130127 s
Trim 345: 0.08779191970825195 s
Trim 350: 0.10423898696899414 s
Trim 355: 0.08973550796508789 s
Trim 360: 0.09125137329101562 s
Trim 365: 0.1061713695526123 s
Trim 370: 0.09549927711486816 s
Trim 375: 0.0945887565612793 s
Trim 380: 0.1141664981842041 s
Trim 385: 0.09656167030334473 s
Trim 390: 0.09741783142089844 s
Trim 395: 0.12029099464416504 s
Trim 400: 0.09925198554992676 s
Trim 405: 0.1011509895324707 s
Trim 410: 0.12583160400390625 s
Trim 415: 0.1045069694519043 s
Trim 420: 0.10638189315795898 s
Trim 425: 0.12271857261657715 s
Trim 430: 0.10671520233154297 s
Trim 435: 0.10863161087036133 s
Trim 440: 0.12616586685180664 s
Trim 445: 0.11424088478088379 s
Trim 450: 0.11148786544799805 s
Trim 455: 0.13032269477844238 s
Trim 460: 0.11519789695739746 s
Trim 465: 0.1179652214050293 s
Trim 470: 0.12706661224365234 s
Trim 475: 0.11927914619445801 s
Trim 480: 0.11730289459228516 s
Trim 485: 0.13298892974853516 s
Trim 490: 0.12297725677490234 s
Trim 495: 0.12051892280578613 s
Trim 500: 0.1350400447845459 s
Trim 505: 0.1254897117614746 s
Trim 510: 0.1261124610900879 s
Trim 515: 0.13534116744995117 s
Trim 520: 0.12975478172302246 s
Trim 525: 0.1306755542755127 s
Trim 530: 0.1432344913482666 s
Trim 535: 0.1339256763458252 s
Trim 540: 0.13191962242126465 s
Trim 545: 0.14847111701965332 s
Trim 550: 0.138657808303833 s
Trim 555: 0.13456296920776367 s
Trim 560: 0.1497037410736084 s
Trim 565: 0.1382899284362793 s
Trim 570: 0.13804960250854492 s
Trim 575: 0.14861106872558594 s
Trim 580: 0.14386320114135742 s
Trim 585: 0.14127588272094727 s
Trim 590: 0.1532754898071289 s
I didn’t expect this, as I am just trimming to constantly sized windows.
This does not happen when I call sox directly:
for i in {1..1000..5}; do /usr/bin/time -f "Trim $i: %E s" sox file.wav /tmp/out$i.wav trim $i 2 rate 16000; done
...
Trim 916: 0:00.01 s
Trim 921: 0:00.01 s
Trim 926: 0:00.01 s
Trim 931: 0:00.01 s
Trim 936: 0:00.01 s
Trim 941: 0:00.01 s
Trim 946: 0:00.01 s
Trim 951: 0:00.00 s
Trim 956: 0:00.01 s
Trim 961: 0:00.01 s
Trim 966: 0:00.01 s
Trim 971: 0:00.01 s
Trim 976: 0:00.01 s
Trim 981: 0:00.01 s
Trim 986: 0:00.01 s
Trim 991: 0:00.01 s
Trim 996: 0:00.01 s
Why is this? Am I using sox_effects
wrong? Are there memory issues, and if so – how to debug them?
Here’s what memory-profiler gives me:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
79 370.6 MiB 370.6 MiB 1 @profile
80 def _run():
81 370.6 MiB 0.0 MiB 1 path = "../data/grass/conversations/001M002M_001M_HM2_CS_001.wav"
82 373.6 MiB 0.0 MiB 201 for t in range(0, 1000, 5):
83 373.6 MiB 0.0 MiB 200 t1 = time.time()
84 373.6 MiB 3.0 MiB 400 waveform, sampling_rate = torchaudio.sox_effects.apply_effects_file(
85 373.6 MiB 0.0 MiB 200 path, [["trim", str(t), "2"], ["rate", "16000"]]
86 )
87 373.6 MiB 0.0 MiB 200 t2 = time.time()
88 373.6 MiB 0.0 MiB 200 print(f"Trim {t}: {t2-t1} s", flush=True)
Doesn’t seem too bad…
Output with torch.profiler
:
---------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
---------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::empty 0.01% 3.058ms 0.01% 3.058ms 15.139us 24.41 Mb 24.41 Mb 202
aten::zeros 0.04% 9.298ms 0.04% 11.210ms 11.210ms 4 b 0 b 1
aten::zero_ 0.01% 1.877ms 0.01% 1.877ms 1.877ms 0 b 0 b 1
torchaudio::sox_effects_apply_effects_file 99.83% 25.307s 99.87% 25.319s 126.596ms 24.41 Mb 0 b 200
aten::transpose 0.03% 6.372ms 0.03% 8.713ms 43.565us 0 b 0 b 200
aten::as_strided 0.01% 2.341ms 0.01% 2.341ms 11.705us 0 b 0 b 200
trimming 0.08% 20.746ms 99.96% 25.340s 25.340s -4 b -24.41 Mb 1
---------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 25.351s