How should time differ using fp16 amp ? and should memory consumption decrease?

I have a rtx 2080ti and i am training this model.

import time
start = time.time()
epochs = 40

for epoch in range(epochs):
train_accuracy = 0
count = 0
accumulated_loss = 0
for train_in,train_label in tqdm(batch(X_train,Y_train)):
#print(train_in)
torch.cuda.synchronize()
output = model(train_in.to(device))
#print(output.shape)
output = output.squeeze(1)
#print(output.shape)
train_label = train_label.squeeze(-1)
loss = criterion(output,train_label.long().to(device))
#with amp.scale_loss(loss, optimizer) as scaled_loss:
#scaled_loss.backward()
loss.backward()
optimizer.step()
count += 1
train_accuracy += accuracy(output,train_label)
accumulated_loss += loss#scaled_loss
optimizer.zero_grad()

print('Training Epoch {} with Accuracy :{} Loss :{}'.format(epoch+1,train_accuracy/count,accumulated_loss/count))
with torch.no_grad():
    test_acc = 0
    test_count = 0
    for X_t,Y_t in tqdm(batch(X_test,Y_test)):
        test_output = model(X_t.to(device))
        test_output = test_output.squeeze(1)

        Y_t = Y_t.squeeze(-1)
        test_accuracy = accuracy(test_output,Y_t)
        test_acc += test_accuracy
        test_count += 1
        
print('Testing Epoch {} with Accuracy :{}'.format(epoch+1,test_acc/test_count))  

end = time.time()
print(“time equal {}”.format(end-start))

the running time for training without fp16 is 24 and with fp16 is nearly 32. This message appears to me when i check the setup of “apex” library:

Compiling cuda extensions with
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
from /usr/bin

Traceback (most recent call last):
File “setup.py”, line 100, in
check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
File “setup.py”, line 77, in check_cuda_torch_binary_vs_bare_metal
"https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the >version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 10.1.
In some cases, a minor-version mismatch will not cause later errors: >https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. You can try commenting out this >check (at your own risk).

I am using pytorch 1.4 , cuda 10.1. I know there is difference in cuda version , but is that the only problem ? And if there is a version difference how can the model train ?
Anybody help i have been in this for a very long time.
I am now download nvidia-toolkit cuda 10.1 and try to set it up maybe it will fix this.
Any clarification ?

You could disable the check for the minor CUDA version as suggested or the better solution would be to install the matching CUDA verison on your system.
Once this is done, you could rebuild apex with the CUDA extensions, which should give you performance benefits.

My nvidia driver version is 435 and pytorch cuda is 10.1. I cant find a nvidia toolkit that support both versions. Any solution ?
I was using cuda toolkit 9 when i compiled the results, can it work properly ?
I am new in this CUDA world

CUDA10.1 will work with this driver as seen in this table.

If i downloaded this cuda version , i should get the speedup you talked about in your Github Post https://github.com/ptrblck/apex/blob/apex_tutorial/tutorials/apex_mixed_precision_intro.ipynb ?
And also how the library is training while its not fully installed/working ? wheter O1,O2 … it still train

If you didn’t build apex with the CUDA extensions, the methods will “work” but might be slow.

-cpp_ext --cuda_ext
torch.version = 1.4.0
setup.py:43: UserWarning: Option --pyprof not specified. Not installing PyProf dependencies!
warnings.warn(“Option --pyprof not specified. Not installing PyProf dependencies!”)

Compiling cuda extensions with
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
from /usr/local/cuda/bin

running install
running bdist_egg
running egg_info
creating apex.egg-info
writing apex.egg-info/PKG-INFO
writing dependency_links to apex.egg-info/dependency_links.txt
writing top-level names to apex.egg-info/top_level.txt
writing manifest file ‘apex.egg-info/SOURCES.txt’
reading manifest file ‘apex.egg-info/SOURCES.txt’
writing manifest file ‘apex.egg-info/SOURCES.txt’
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/apex
copying apex/init.py -> build/lib.linux-x86_64-3.7/apex
creating build/lib.linux-x86_64-3.7/apex/parallel
copying apex/parallel/multiproc.py -> build/lib.linux-x86_64-3.7/apex/parallel
copying apex/parallel/init.py -> build/lib.linux-x86_64-3.7/apex/parallel
copying apex/parallel/LARC.py -> build/lib.linux-x86_64-3.7/apex/parallel
copying apex/parallel/distributed.py -> build/lib.linux-x86_64-3.7/apex/parallel
copying apex/parallel/sync_batchnorm_kernel.py -> build/lib.linux-x86_64-3.7/apex/parallel
copying apex/parallel/optimized_sync_batchnorm_kernel.py -> build/lib.linux-x86_64-3.7/apex/parallel
copying apex/parallel/optimized_sync_batchnorm.py -> build/lib.linux-x86_64-3.7/apex/parallel
copying apex/parallel/sync_batchnorm.py -> build/lib.linux-x86_64-3.7/apex/parallel
creating build/lib.linux-x86_64-3.7/apex/contrib
copying apex/contrib/init.py -> build/lib.linux-x86_64-3.7/apex/contrib
creating build/lib.linux-x86_64-3.7/apex/optimizers
copying apex/optimizers/fused_novograd.py -> build/lib.linux-x86_64-3.7/apex/optimizers
copying apex/optimizers/init.py -> build/lib.linux-x86_64-3.7/apex/optimizers
copying apex/optimizers/fused_adam.py -> build/lib.linux-x86_64-3.7/apex/optimizers
copying apex/optimizers/fused_lamb.py -> build/lib.linux-x86_64-3.7/apex/optimizers
copying apex/optimizers/fused_sgd.py -> build/lib.linux-x86_64-3.7/apex/optimizers
creating build/lib.linux-x86_64-3.7/apex/amp
copying build/lib.linux-x86_64-3.7/apex/optimizers/fused_sgd.py -> build/bdist.linux-x86_64/egg/apex/optimizers
creating build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/scaler.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/compat.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/init.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/version.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/_process_optimizer.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/handle.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/amp.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/_amp_state.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/_initialize.py -> build/bdist.linux-x86_64/egg/apex/amp
creating build/bdist.linux-x86_64/egg/apex/amp/lists
copying build/lib.linux-x86_64-3.7/apex/amp/lists/tensor_overrides.py -> build/bdist.linux-x86_64/egg/apex/amp/lists
copying build/lib.linux-x86_64-3.7/apex/amp/lists/init.py -> build/bdist.linux-x86_64/egg/apex/amp/lists
copying build/lib.linux-x86_64-3.7/apex/amp/lists/torch_overrides.py -> build/bdist.linux-x86_64/egg/apex/amp/lists
copying build/lib.linux-x86_64-3.7/apex/amp/lists/functional_overrides.py -> build/bdist.linux-x86_64/egg/apex/amp/lists
copying build/lib.linux-x86_64-3.7/apex/amp/rnn_compat.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/opt.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/frontend.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/wrap.py -> build/bdist.linux-x86_64/egg/apex/amp
copying build/lib.linux-x86_64-3.7/apex/amp/utils.py -> build/bdist.linux-x86_64/egg/apex/amp
creating build/bdist.linux-x86_64/egg/apex/multi_tensor_apply
copying build/lib.linux-x86_64-3.7/apex/multi_tensor_apply/init.py -> build/bdist.linux-x86_64/egg/apex/multi_tensor_apply
copying build/lib.linux-x86_64-3.7/apex/multi_tensor_apply/multi_tensor_apply.py -> build/bdist.linux-x86_64/egg/apex/multi_tensor_apply
creating build/bdist.linux-x86_64/egg/apex/pyprof
copying build/lib.linux-x86_64-3.7/apex/pyprof/init.py -> build/bdist.linux-x86_64/egg/apex/pyprof
creating build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/utility.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/recurrentCell.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/normalization.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/data.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/main.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/init.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/pointwise.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/linear.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/base.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/usage.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/convert.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/randomSample.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/activation.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/output.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/optim.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/misc.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/reduction.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/pooling.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/embedding.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/index_slice_join_mutate.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/prof.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/blas.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/dropout.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/softmax.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/loss.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
copying build/lib.linux-x86_64-3.7/apex/pyprof/prof/conv.py -> build/bdist.linux-x86_64/egg/apex/pyprof/prof
creating build/bdist.linux-x86_64/egg/apex/pyprof/nvtx
copying build/lib.linux-x86_64-3.7/apex/pyprof/nvtx/init.py -> build/bdist.linux-x86_64/egg/apex/pyprof/nvtx
copying build/lib.linux-x86_64-3.7/apex/pyprof/nvtx/nvmarker.py -> build/bdist.linux-x86_64/egg/apex/pyprof/nvtx
creating build/bdist.linux-x86_64/egg/apex/pyprof/parse
copying build/lib.linux-x86_64-3.7/apex/pyprof/parse/db.py -> build/bdist.linux-x86_64/egg/apex/pyprof/parse
copying build/lib.linux-x86_64-3.7/apex/pyprof/parse/main.py -> build/bdist.linux-x86_64/egg/apex/pyprof/parse
copying build/lib.linux-x86_64-3.7/apex/pyprof/parse/init.py -> build/bdist.linux-x86_64/egg/apex/pyprof/parse
copying build/lib.linux-x86_64-3.7/apex/pyprof/parse/parse.py -> build/bdist.linux-x86_64/egg/apex/pyprof/parse
copying build/lib.linux-x86_64-3.7/apex/pyprof/parse/kernel.py -> build/bdist.linux-x86_64/egg/apex/pyprof/parse
copying build/lib.linux-x86_64-3.7/apex/pyprof/parse/nvvp.py -> build/bdist.linux-x86_64/egg/apex/pyprof/parse
creating build/bdist.linux-x86_64/egg/apex/RNN
copying build/lib.linux-x86_64-3.7/apex/RNN/init.py -> build/bdist.linux-x86_64/egg/apex/RNN
copying build/lib.linux-x86_64-3.7/apex/RNN/RNNBackend.py -> build/bdist.linux-x86_64/egg/apex/RNN
copying build/lib.linux-x86_64-3.7/apex/RNN/cells.py -> build/bdist.linux-x86_64/egg/apex/RNN
copying build/lib.linux-x86_64-3.7/apex/RNN/models.py -> build/bdist.linux-x86_64/egg/apex/RNN
creating build/bdist.linux-x86_64/egg/apex/normalization
copying build/lib.linux-x86_64-3.7/apex/normalization/init.py -> build/bdist.linux-x86_64/egg/apex/normalization
copying build/lib.linux-x86_64-3.7/apex/normalization/fused_layer_norm.py -> build/bdist.linux-x86_64/egg/apex/normalization
creating build/bdist.linux-x86_64/egg/apex/reparameterization
copying build/lib.linux-x86_64-3.7/apex/reparameterization/weight_norm.py -> build/bdist.linux-x86_64/egg/apex/reparameterization
copying build/lib.linux-x86_64-3.7/apex/reparameterization/init.py -> build/bdist.linux-x86_64/egg/apex/reparameterization
copying build/lib.linux-x86_64-3.7/apex/reparameterization/reparameterization.py -> build/bdist.linux-x86_64/egg/apex/reparameterization
creating build/bdist.linux-x86_64/egg/apex/fp16_utils
copying build/lib.linux-x86_64-3.7/apex/fp16_utils/init.py -> build/bdist.linux-x86_64/egg/apex/fp16_utils
copying build/lib.linux-x86_64-3.7/apex/fp16_utils/loss_scaler.py -> build/bdist.linux-x86_64/egg/apex/fp16_utils
copying build/lib.linux-x86_64-3.7/apex/fp16_utils/fp16util.py -> build/bdist.linux-x86_64/egg/apex/fp16_utils
copying build/lib.linux-x86_64-3.7/apex/fp16_utils/fp16_optimizer.py -> build/bdist.linux-x86_64/egg/apex/fp16_utils
copying build/lib.linux-x86_64-3.7/apex_C.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.7/syncbn.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
byte-compiling build/bdist.linux-x86_64/egg/apex/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/multiproc.py to multiproc.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/LARC.py to LARC.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/distributed.py to distributed.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/sync_batchnorm_kernel.py to sync_batchnorm_kernel.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/optimized_sync_batchnorm_kernel.py to optimized_sync_batchnorm_kernel.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/optimized_sync_batchnorm.py to optimized_sync_batchnorm.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/parallel/sync_batchnorm.py to sync_batchnorm.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/contrib/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/contrib/optimizers/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/contrib/optimizers/fused_adam.py to fused_adam.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/contrib/optimizers/fused_sgd.py to fused_sgd.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/contrib/optimizers/fp16_optimizer.py to fp16_optimizer.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/contrib/xentropy/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/contrib/xentropy/softmax_xentropy.py to softmax_xentropy.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/contrib/groupbn/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/contrib/groupbn/batch_norm.py to batch_norm.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/optimizers/fused_novograd.py to fused_novograd.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/optimizers/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/optimizers/fused_adam.py to fused_adam.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/optimizers/fused_lamb.py to fused_lamb.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/optimizers/fused_sgd.py to fused_sgd.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/scaler.py to scaler.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/compat.py to compat.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/version.py to version.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/_process_optimizer.py to _process_optimizer.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/handle.py to handle.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/amp.py to amp.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/_amp_state.py to _amp_state.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/_initialize.py to _initialize.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/lists/tensor_overrides.py to tensor_overrides.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/lists/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/lists/torch_overrides.py to torch_overrides.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/lists/functional_overrides.py to functional_overrides.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/rnn_compat.py to rnn_compat.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/opt.py to opt.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/frontend.py to frontend.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/wrap.py to wrap.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/amp/utils.py to utils.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/multi_tensor_apply/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/multi_tensor_apply/multi_tensor_apply.py to multi_tensor_apply.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/utility.py to utility.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/recurrentCell.py to recurrentCell.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/normalization.py to normalization.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/data.py to data.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/main.py to main.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/pointwise.py to pointwise.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/linear.py to linear.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/base.py to base.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/usage.py to usage.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/convert.py to convert.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/randomSample.py to randomSample.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/activation.py to activation.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/output.py to output.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/optim.py to optim.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/misc.py to misc.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/reduction.py to reduction.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/pooling.py to pooling.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/embedding.py to embedding.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/index_slice_join_mutate.py to index_slice_join_mutate.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/prof.py to prof.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/blas.py to blas.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/dropout.py to dropout.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/softmax.py to softmax.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/loss.py to loss.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/prof/conv.py to conv.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/nvtx/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/nvtx/nvmarker.py to nvmarker.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/parse/db.py to db.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/parse/main.py to main.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/parse/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/parse/parse.py to parse.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/parse/kernel.py to kernel.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/pyprof/parse/nvvp.py to nvvp.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/RNN/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/RNN/RNNBackend.py to RNNBackend.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/RNN/cells.py to cells.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/RNN/models.py to models.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/normalization/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/normalization/fused_layer_norm.py to fused_layer_norm.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/reparameterization/weight_norm.py to weight_norm.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/reparameterization/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/reparameterization/reparameterization.py to reparameterization.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/fp16_utils/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/fp16_utils/loss_scaler.py to loss_scaler.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/fp16_utils/fp16util.py to fp16util.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/apex/fp16_utils/fp16_optimizer.py to fp16_optimizer.cpython-37.pyc
creating stub loader for apex_C.cpython-37m-x86_64-linux-gnu.so
creating stub loader for amp_C.cpython-37m-x86_64-linux-gnu.so
creating stub loader for syncbn.cpython-37m-x86_64-linux-gnu.so
creating stub loader for fused_layer_norm_cuda.cpython-37m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/apex_C.py to apex_C.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/amp_C.py to amp_C.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/syncbn.py to syncbn.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/fused_layer_norm_cuda.py to fused_layer_norm_cuda.cpython-37.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying apex.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying apex.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying apex.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying apex.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents…
pycache.amp_C.cpython-37: module references file
pycache.apex_C.cpython-37: module references file
pycache.fused_layer_norm_cuda.cpython-37: module references file
pycache.syncbn.cpython-37: module references file
apex.pyprof.nvtx.pycache.nvmarker.cpython-37: module references file
apex.pyprof.nvtx.pycache.nvmarker.cpython-37: module references path
creating dist
creating ‘dist/apex-0.1-py3.7-linux-x86_64.egg’ and adding ‘build/bdist.linux-x86_64/egg’ to it
removing ‘build/bdist.linux-x86_64/egg’ (and everything under it)
Processing apex-0.1-py3.7-linux-x86_64.egg
creating /home/ahmed/anaconda3/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg
Extracting apex-0.1-py3.7-linux-x86_64.egg to /home/ahmed/anaconda3/lib/python3.7/site-packages
Adding apex 0.1 to easy-install.pth file

Installed /home/ahmed/anaconda3/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg
Processing dependencies for apex==0.1
Finished processing dependencies for apex==0.1

Does this mean its working ? Should i restart my model to check for speedups ? Thank you much sir for your patience. Really appreciated

@ptrblck no speedup. I restarted the notebook and using O1 level. This is maybe the model isnot large ? i am using batch size of 512

Might be. What model are you using at the moment?

Also, it seems you are storing the computation graph in this line of code:

accumulated_loss += loss#scaled_loss

If you don’t need to call backward on accumulated_loss, make sure to use accumulated_loss += loss.detach().

I am using this model.

class CNN_TC(nn.Module):
def init(self,googleNews_vectors,Filters):
super().init()
self.embedding = nn.Embedding.from_pretrained(googleNews_vectors,freeze=True)
self.embedding.weight.requires_grad = False
self.conv1 = nn.Conv1d(1,1,kernel_size=(Filters[0] * 300),stride=(300))
self.conv2 = nn.Conv1d(1,1,kernel_size=(Filters[1] * 300),stride=(300))
self.conv3 = nn.Conv1d(1,1,kernel_size=(Filters[2] * 300),stride=(300))

    self.max_conv1 = nn.MaxPool1d(kernel_size=(15))
    self.max_conv2 = nn.MaxPool1d(kernel_size=(11))
    self.max_conv3 = nn.MaxPool1d(kernel_size=(9))
    
    self.FFN1 = nn.Linear(9,100)
    self.FFN2 = nn.Linear(100,2)
    
def forward(self,inputs):
    X = self.embedding(inputs)
    X = X.view(X.shape[0],1,-1).contiguous()

    max_feature1 = self.max_conv1(f.relu(self.conv1(X)))
    max_feature2 = self.max_conv2(f.relu(self.conv2(X)))
    max_feature3 = self.max_conv3(f.relu(self.conv3(X)))
    features = torch.cat((max_feature1,max_feature2,max_feature3), -1).squeeze(0)
                         
    output = self.FFN2(f.relu(self.FFN1(features)))
    return output

Could you post the shape of your input so that we could profile it, please?

@ptrblck the shape of the input is [512,46] , you can remove the embedding layer and make it [512,46,300] if you dont want to download the google neg 300

[512, 46, 300] doesn’t work, as the first conv expects the input to have a single input channel.
If I use a single input channel, the activation is too small after a certain layer:

Given input size: (1x1x1). Calculated output size: (1x1x0). Output size is too small

Using an input of x = torch.randint(0, 100, (512, 46)) and initializing the embedding via nn.Embedding(100, 300), yields:

size mismatch, m1: [512 x 12], m2: [9 x 100] at /opt/conda/conda-bld/pytorch_1581149526594/work/aten/src/TH/generic/THTensorMath.cpp:41

Could you post an executable code snippet, please?

The second line in the forward function transforms the input to [512,1, 46*300],

After fixing an issue with squeeze (changed it to .squeeze(1)) I get 742.706us for a single forward pass. Since the time is really low due to the small workload, you won’t see any speedup unfortunately.

If your workload is a bit higher (e.g. resnet50), you should see a speedup of approx. 2x.

@ptrblck thank you , i will try it then. Should i use O1, or O3 for the opt_lvl ? I mean does fp16 would train faster or not always the case ?

You should use opt_level='O1' as this is the safest mixed-precision mode.
For cudnn>=7.3 you should see a speedup for convolutions, while matrix multiplications require a shape of multiples of 8 to be executed on the TensorCores (post with some details).

My tensor cores wont work for anything not multiple of 8 ? I thought that they are triggered on their own. And for the cudnn is there is a certain command to trigger it to use the tensor cores or just multiple of 8 ?
To check if i understand this correctly , if i make the embeddings 256 instead of 300 , that means that all the conv layers are 3x256,4x256 … which is a multiple of 8 , is this correct ? And does this contribute in the speedup ?

Convolutions do not have the shape requirements starting with cudnn7.3, while matrix multiplications still have it.
Embedding layers are a lookup table so the dense embedded output will be used in the next layers, no?

I mean that when i change the embedding size to 256 , the first conv1d layer matrix operation would multiply [256*3] with the 3 words from the 46 length sequence with embedding 256 , so the whole operation is a multiple of 8 , or i am still not getting it right ? or what do you recommend to change to benefit from the tensor cores speed up