I’m doing a prediction on top of 5 documents and checking the profiling outputs. Each document may have 400+ pages each. I have around 5500 documents to lookup and it takes several hours to execute the prediction pipeline.
CPU RAM → more than 100GB available
GPU → 2 NVIDIA tesla v100 machines, 32GB memory each.
Even for 5 documents, it takes 520+ seconds to execute.
Here is the profiling output:
205403359 function calls (192726646 primitive calls) in 525.353 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
257 300.185 1.168 300.185 1.168 {method 'cpu' of 'torch._C._TensorBase' objects}
5 22.183 4.437 25.969 5.194 base.py:867(dump)
27 22.127 0.820 22.127 0.820 {built-in method gc.collect}
34 14.393 0.423 14.406 0.424 {built-in method _pickle.load}
1067 14.166 0.013 14.166 0.013 {method 'normal_' of 'torch._C._TensorBase' objects}
5722 11.909 0.002 12.199 0.002 <frozen importlib._bootstrap_external>:914(get_data)
69413 10.954 0.000 10.954 0.000 {built-in method posix.stat}
1766 9.109 0.005 9.109 0.005 {method 'uniform_' of 'torch._C._TensorBase' objects}
386 7.326 0.019 7.326 0.019 {built-in method numpy.concatenate}
1989 6.604 0.003 6.604 0.003 {method 'cuda' of 'torch._C._TensorBase' objects}
145759 6.180 0.000 6.180 0.000 {method 'findall' of 're.Pattern' objects}
25863123 5.718 0.000 13.354 0.000 {built-in method builtins.isinstance}
2384 5.379 0.002 5.379 0.002 {method 'copy_' of 'torch._C._TensorBase' objects}
11924 5.325 0.000 5.325 0.000 {built-in method tensor}
11243096/72898 4.779 0.000 21.038 0.000 mixins.py:114(_build)
950 3.833 0.004 3.835 0.004 {built-in method io.open}
22511107 3.731 0.000 4.914 0.000 {built-in method _abc._abc_instancecheck}
5 3.572 0.714 3.572 0.714 {built-in method _pickle.dump}
2383 3.346 0.001 3.346 0.001 {method '_set_from_file' of 'torch._C.FloatStorageBase' objects}
83 3.283 0.040 3.283 0.040 {method 'execute' of 'sybpydb.Cursor' objects}
11206162 3.023 0.000 19.845 0.000 mixins.py:133(<genexpr>)
Is there anyway I can speed this up? Especially, this one:
{method ‘cpu’ of ‘torch._C._TensorBase’ objects}
I tried to use pin_memory=True
in DataLoader, but it didnt make any difference. Really appreciate your suggestions.