I’m using pytorch basically as a GPU-accelerated numpy due to its simplicity to switch between CPU and GPU.
Now I’m trying to use tensorcore to accelerate some parts of the code that is mostly matmul. Though I have wrapped it with torch.autocast(), the tensorcore is not actually used by the profiling result.
The involved data is casted to float32 first (actually complex64 mostly) and then transferred to pytorch and CUDA, so I don’t think I’m violating the part described in the doc:
Ops called with an explicit dtype=... argument are not eligible, and will produce output that respects the dtype argument.
I have the following questions:
Would casting b and c into float16 and call b @ c automatically lead to tensorcore usage?
Are there any ways that I can explicitly use tensorcore without doing autocast? e.g. call something explicitly to use tensorcore?
Thanks for replying. I have a 2080 Ti so I think there are Tensorcore on it. I can confirm that the tensorcore is used if tensors are in float32 but not in complex64, so maybe complex format is not supported for tft operations?
device = torch.device("cuda")
# a = (torch.randn((40, 80)) + 1j * torch.randn((40, 80))).to(device)
# b = (torch.randn((80, 40)) + 1j * torch.randn((80, 40))).to(device)
a = torch.randn((40, 80)).to(device)
b = torch.randn((80, 40)).to(device)
iters = 10
prof = profiler.profile(
schedule=profiler.schedule(wait=0, warmup=2, active=iters-2, repeat=1),
with_stack=True,
on_trace_ready=profiler.tensorboard_trace_handler("./tft", worker_name="tft_real"),
profile_memory=True,
record_shapes=True,
)
times = []
for _ in range(iters):
prof.start()
with torch.autocast("cuda"):
c = a @ b
prof.step()
prof.stop()
print(a.dtype, c.dtype)