Hey guys, based on the topic, I’m not sure if this is possible for my use case but thought I’d ask just in case. I’ve had a read of the few different posts already existing on the forum associated to CPU <==> GPU memory transfer speeds but they don’t seem particularly applicable to my scenario, e.g.
- How to maximize CPU <==> GPU memory transfer speeds? - #6 by aifartist
- Data Transfer slow from gpu to cpu
- A guide on good usage of non_blocking and pin_memory() in PyTorch — PyTorch Tutorials 2.4.0+cu121 documentation
Pinned memory seems like it can increase the speed but I’m not really sure if it’s applicable to my scenario.
The following is a general python-based pseudocode of the process.
model = load_model().to("cuda")
tokeniser = load_tokeniser()
while True:
data = fetch_from_postgres()
if not data:
break
encoded_input = tokeniser(data).to("cuda")
with torch.no_grad():
# model output is m * n tensor;
model_output = model(**encoded_input)
# convert tensors to list to write to database;
embeddings = [tensor.tolist() for tensor in model_output]
write_to_postgres(embeddings)
Basically I’m reading text records from a postgres database, converting those records to embeddings and writing back the embeddings to the postgres database. I’ve benchmarked different sections of the code to try and understand where most of the batch times are and they seem to be mainly due to moving data from the GPU to the CPU with the tensor.tolist() function. I’ve done some benchmarks with moving the model_output.to(“cpu”) before the tensor.tolist() and noticed that the bulk of the time moves from the tensor.tolist() and to model_output.to(“cpu”).
In terms of machine specifications it’s as follows:
CPU: 20 vCores
RAM: 128 GB
GPU: 46 GB
In terms of times, the different blocks of codes observed are on average:
fetch_from_postgres: < 1s
encode_input: < 1s
model_output: < 1s
tensor.tolist() for m items: ~20s
write_to_postgres: <2s
Any ideas on if the embeddings process can be sped up or if it’s just something I’ll have to put up with in regards to data transfer speeds between the CPU and GPU?