I’m trying to optimize a deployed application.
The data I’m working with is a list of strings that is already loaded into the application as a variable before calling the dataloader.
I’m using 9 models total. All are pertained AlbertForSequenceClassifier.
I’ve tried a few variations of different num_worker counts and adding additional CPUs to the deployment.
Every-time Num_Workers=0 beats the performance by a large margin.
The larger the data the smaller the speed increase but on my smallest data sample it went from 10.89s seconds to .27 seconds.
dataloader = DataLoader( dataset=dataset, shuffle=False, sampler=None, batch_size=128, collate_fn=partial(prepare_sample, tokenizer=tokenizer), num_workers=0) results =  for batch in dataloader: input_ids, attention_mask = batch input_ids = input_ids.to(model.device) attention_mask = attention_mask.to(model.device) with torch.no_grad(): logits = model(input_ids, attention_mask=attention_mask) _, pred = torch.max(logits, dim=1) results.append(pred)
prediction = torch.cat(results, dim=0).detach().cpu().numpy().tolist()
For each model I call the above function which iterates over the data and returns the prediction. So the same data is being loaded 9 times.
The majority of the execution time is on “logits = model(input_ids, attention_mask=attention_mask)”
I’m new to using pytorch so there are probably some obvious changes here that I’m unaware of.