Use tensorcore explicitly on non-DL code

I’m using pytorch basically as a GPU-accelerated numpy due to its simplicity to switch between CPU and GPU.
Now I’m trying to use tensorcore to accelerate some parts of the code that is mostly matmul. Though I have wrapped it with torch.autocast(), the tensorcore is not actually used by the profiling result.

The involved data is casted to float32 first (actually complex64 mostly) and then transferred to pytorch and CUDA, so I don’t think I’m violating the part described in the doc:

Ops called with an explicit dtype=... argument are not eligible, and will produce output that respects the dtype argument.

I have the following questions:

  1. Would casting b and c into float16 and call b @ c automatically lead to tensorcore usage?
  2. Are there any ways that I can explicitly use tensorcore without doing autocast? e.g. call something explicitly to use tensorcore?
  3. Does tensorcore work for complex data?

Autocast should use mixed-precision if the input data and parameters are in float32. Could you verify that your GPU actually has TensorCores?

Thanks for replying. I have a 2080 Ti so I think there are Tensorcore on it. I can confirm that the tensorcore is used if tensors are in float32 but not in complex64, so maybe complex format is not supported for tft operations?

device = torch.device("cuda")
# a = (torch.randn((40, 80)) + 1j * torch.randn((40, 80))).to(device)
# b = (torch.randn((80, 40)) + 1j * torch.randn((80, 40))).to(device)
a = torch.randn((40, 80)).to(device)
b = torch.randn((80, 40)).to(device)

iters = 10
prof = profiler.profile(
    schedule=profiler.schedule(wait=0, warmup=2, active=iters-2, repeat=1),
    with_stack=True,
    on_trace_ready=profiler.tensorboard_trace_handler("./tft", worker_name="tft_real"),
    profile_memory=True,
    record_shapes=True,
)

times = []
for _ in range(iters):
    prof.start()
    with torch.autocast("cuda"):
        c = a @ b
    prof.step()
    prof.stop()
print(a.dtype, c.dtype)