PyTorch Profiler Kineto is not available

Hello!
I want to use PyTorch profiler as in this example:

But I get error:

AssertionError: Requested Kineto profiling but Kineto is not available, make sure PyTorch is built with USE_KINETO=1

I instaled PyTorch by command:
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

I’m unable to reproduce the issue using the 1.8.1 conda binaries and pip wheels and this simple code:

import torch
import torch.nn as nn

x = torch.randn(1, 1).cuda()
lin = nn.Linear(1, 1).cuda()

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA]
) as p:
    for _ in range(10):
        out = lin(x)
print(p.key_averages().table(
    sort_by="self_cuda_time_total", row_limit=-1))

Both show a valid output:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            aten::addmm         0.68%      22.190ms        99.98%        3.239s     323.913ms     166.000us       100.00%     166.000us      16.600us            10  
void cutlass::Kernel<cutlass_80_tensorop_s1688gemm_6...         0.00%       0.000us         0.00%       0.000us       0.000us     142.000us        85.54%     142.000us      14.200us            10  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      21.000us        12.65%      21.000us       2.100us            10  
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us         1.81%       3.000us       3.000us             1  
                                           aten::linear         0.02%     544.000us       100.00%        3.240s     323.988ms       0.000us         0.00%     166.000us      16.600us            10  
                                                aten::t         0.00%     134.000us         0.01%     204.000us      20.400us       0.000us         0.00%       0.000us       0.000us            10  
                                        aten::transpose         0.00%      50.000us         0.00%      70.000us       7.000us       0.000us         0.00%       0.000us       0.000us            10  
                                       aten::as_strided         0.00%      24.000us         0.00%      24.000us       1.200us       0.000us         0.00%       0.000us       0.000us            20  
                                            aten::empty         0.00%      60.000us         0.00%      60.000us       6.000us       0.000us         0.00%       0.000us       0.000us            10  
                                           aten::expand         0.00%      23.000us         0.00%      27.000us       2.700us       0.000us         0.00%       0.000us       0.000us            10  
                                          aten::resize_         0.00%      72.000us         0.00%      72.000us       7.200us       0.000us         0.00%       0.000us       0.000us            10  
                                        cudaMemcpyAsync        72.69%        2.355s        72.69%        2.355s     235.493ms       0.000us         0.00%       0.000us       0.000us            10  
                                               cudaFree        26.57%     860.777ms        26.57%     860.777ms     286.926ms       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.00%       2.000us         0.00%       2.000us       0.167us       0.000us         0.00%       0.000us       0.000us            12  
                                             cudaMalloc         0.03%     943.000us         0.03%     943.000us     235.750us       0.000us         0.00%       0.000us       0.000us             4  
                                             cudaMemcpy         0.00%      28.000us         0.00%      28.000us      28.000us       0.000us         0.00%       0.000us       0.000us             1  
                                   cudaFuncSetAttribute         0.00%       9.000us         0.00%       9.000us       0.005us       0.000us         0.00%       0.000us       0.000us          1660  
                               cudaEventCreateWithFlags         0.00%       3.000us         0.00%       3.000us       0.167us       0.000us         0.00%       0.000us       0.000us            18  
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.00%       2.000us         0.00%       2.000us       0.100us       0.000us         0.00%       0.000us       0.000us            20  
                                       cudaLaunchKernel         0.00%      91.000us         0.00%      91.000us       9.100us       0.000us         0.00%       0.000us       0.000us            10  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.240s
Self CUDA time total: 166.000us

I have got same issue in Your code.

My settings are:
python: 3.9.5
OS: Windows 10
cuda: 11.0
GPU: RTX 3060Ti

I’m unsure, if Kineto is available on Windows, as I was using a Linux OS.

3060 Ti? How did you get it? :thinking:

I have the same issue, also on the same versions as @kjnm.

Can someone confirm whether Kineto is supposed to be available on Windows?