Difference in inference time between CUDA 10.0 & 10.2

We have a working library that uses LibTorch 1.5.0, built with CUDA 10.0 which runs as expected. We are working on upgrading to CUDA 10.2 for various non-PyTorch related reasons. We noticed that when we run LibTorch inference on the newly compiled LibTorch (compiled exactly the same, except changing to CUDA 10.2), the runtime is about 20x slower.
We also checked it using the precompiled binaries. This was tested on 3 different machines using 3 different GPUs (Tesla T4, GTX980 & P1000) and all gives consistent ~20x slower on CUDA 10.2
(Both on Windows 10 & Ubuntu 16.04), all with the latest drivers

I’ve simplified the code to be extremely minimal without external dependencies other than Torch
Tested also on multiple different torch scripts.

int main(int argc, char** argv)
	// Initialize CUDA device 0

	std::string networkPath = DEFAULT_TORCH_SCRIPT;
	if (argc > 1)
		networkPath = argv[1];
	auto jitModule = std::make_shared<torch::jit::Module>(torch::jit::load(networkPath, torch::kCUDA));
	if (jitModule == nullptr)
		std::cerr << "Failed creating module" << std::endl;
		return EXIT_FAILURE;

        // Meaningless data, just something to pass to the module to run on
        // PATCH_HEIGHT & WIDTH are defined as 256
	uint8_t* data = new uint8_t[PATCH_HEIGHT * PATCH_WIDTH * 3];
	memset(data, 0, PATCH_HEIGHT * PATCH_WIDTH * 3);
	auto stream = at::cuda::getStreamFromPool(true, 0);

	bool res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);

	std::cout << "Warmed up" << std::endl;

	res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);

	delete[] data;
	return 0;

// Inference function

bool infer(std::shared_ptr<JitModule>& jitModule, at::cuda::CUDAStream& stream, const uint8_t* inputData, int width, int height)
	std::vector<torch::jit::IValue> tensorInput;
    // This function simply uses cudaMemcpy to copy to device and create a torch::Tensor from that data 
    // I can paste it if it's relevant but didn't now to keep as clean as possible
	if (!prepareInput(inputData, width, height, tensorInput, stream))
		return false;
	// Reduce memory usage, without gradients
	torch::NoGradGuard noGrad;
		at::cuda::CUDAStreamGuard streamGuard(stream);
		auto totalTimeStart = std::chrono::high_resolution_clock::now();

		// The synchronize here is just for timing sake, not use in production
		auto totalTimeStop = std::chrono::high_resolution_clock::now();
		printf("forward sync time = %.3f milliseconds\n",
			std::chrono::duration<double, std::milli>(totalTimeStop - totalTimeStart).count());

	return true;

When compiling this with Torch that was compiled using CUDA 10.0 we get a runtime of 18 ms and when we run it with Torch compiled with CUDA 10.2, we get a runtime of 430 ms

Any thoughts on that?

Opened an issue for this in GitHub

Proposed cudnn.benchmark mode in the issue.
Let’s continue the discussion there.