Speeding up data transfer between CPU and GPU

craig.matadeen · September 26, 2024, 11:05am

Hey guys, based on the topic, I’m not sure if this is possible for my use case but thought I’d ask just in case. I’ve had a read of the few different posts already existing on the forum associated to CPU <==> GPU memory transfer speeds but they don’t seem particularly applicable to my scenario, e.g.

Pinned memory seems like it can increase the speed but I’m not really sure if it’s applicable to my scenario.

The following is a general python-based pseudocode of the process.

model = load_model().to("cuda")
tokeniser = load_tokeniser()

while True:
	data = fetch_from_postgres()

	if not data:
		break

	encoded_input = tokeniser(data).to("cuda")
	
	with torch.no_grad():
		# model output is m * n tensor;
		model_output = model(**encoded_input)

	# convert tensors to list to write to database;
	embeddings = [tensor.tolist() for tensor in model_output]

	write_to_postgres(embeddings)

Basically I’m reading text records from a postgres database, converting those records to embeddings and writing back the embeddings to the postgres database. I’ve benchmarked different sections of the code to try and understand where most of the batch times are and they seem to be mainly due to moving data from the GPU to the CPU with the tensor.tolist() function. I’ve done some benchmarks with moving the model_output.to(“cpu”) before the tensor.tolist() and noticed that the bulk of the time moves from the tensor.tolist() and to model_output.to(“cpu”).

In terms of machine specifications it’s as follows:

CPU: 20 vCores
RAM: 128 GB
GPU: 46 GB

In terms of times, the different blocks of codes observed are on average:

fetch_from_postgres: < 1s
encode_input: < 1s
model_output: < 1s
tensor.tolist() for m items: ~20s
write_to_postgres: <2s

Any ideas on if the embeddings process can be sped up or if it’s just something I’ll have to put up with in regards to data transfer speeds between the CPU and GPU?

ptrblck · September 26, 2024, 7:51pm

The explicit .cpu() call or the implicit data transfer to the host via .tolist() will synchronize your code and will thus accumulate the GPU execution time of already scheduled kernels. To properly profile kernels you would need to synchronize the device before starting and stopping host timers.

craig.matadeen · September 28, 2024, 6:12pm

Thanks for the response. From this I realised that the time is mainly in the embeddings generation section and that I was timing the different sections incorrectly.