I am trying to make use of either distributed or parallel training using fastai and SageMaker notebooks or training jobs (somewhat fixed on using this service based on my team). I am running code on a ml.p3.8xlarge
with 4x V100, but I cannot get any speed ups with any of the approaches I have taken.
After spinning up the ml.p3.8xlarge
notebook instance, here is the set up in my notebook using the pytorch env:
%%bash
pip install fastai==2.0.0 fastcore==1.0.0
sudo mkdir -p /opt/ml/input/data/collab
sudo chmod 777 /opt/ml/input/data/collab
Here is the code I am testing:
import fastai, fastcore, torch
print(f'fastai {fastai.__version__}')
print(f'fastcore {fastcore.__version__}')
print(f'torch {torch.__version__}')
from fastai.collab import *
from fastai.tabular.all import *
from fastai.distributed import *
path = untar_data(URLs.ML_100k, dest="/opt/ml/input/data/collab")
ratings = pd.read_csv(
path/'u.data',
delimiter='\t',
header=None,
names=['user','movie','rating','timestamp']
)
movies = pd.read_csv(
path/'u.item',
delimiter='|',
encoding='latin-1',
usecols=(0,1),
names=['movie','title'],
header=None,
)
ratings = ratings.merge(movies)
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 64
model = EmbeddingDotBias(n_factors, n_users, n_movies)
learn = Learner(dls, model, loss_func=MSELossFlat())
print(learn.model)
print("rank_distrib():", rank_distrib())
print("num_distrib():", num_distrib())
print("torch.cuda.device_count():", torch.cuda.device_count())
epochs, lr = 5, 5e-3
print('learn.fit_one_cycle')
learn.fit_one_cycle(epochs, lr)
print('with learn.distrib_ctx():')
with learn.distrib_ctx():
learn.fit_one_cycle(epochs, lr)
print('with learn.distrib_ctx(torch.cuda.device_count()-1):')
with learn.distrib_ctx(torch.cuda.device_count()-1):
learn.fit_one_cycle(epochs, lr)
print('with learn.parallel_ctx():')
with learn.parallel_ctx():
learn.fit_one_cycle(epochs, lr)
print('nn.DataParallel(learn.model)')
if torch.cuda.device_count() > 1:
learn.model = nn.DataParallel(learn.model)
learn.fit_one_cycle(epochs, lr)
Here is the output from running code as a script:
sh-4.2$ /home/ec2-user/anaconda3/envs/pytorch_p36/bin/python /home/ec2-user/SageMaker/cf.py
fastai 2.0.0
fastcore 0.1.39
torch 1.6.0
EmbeddingDotBias(
(u_weight): Embedding(944, 64)
(i_weight): Embedding(1665, 64)
(u_bias): Embedding(944, 1)
(i_bias): Embedding(1665, 1)
)
rank_distrib(): 0
num_distrib(): 0
torch.cuda.device_count(): 4
learn.fit_one_cycle
epoch train_loss valid_loss time
0 1.153435 1.154428 00:11
1 0.957201 0.954827 00:11
2 0.816548 0.878350 00:11
with learn.distrib_ctx():
epoch train_loss valid_loss time
0 0.999254 1.040871 00:11
1 0.821853 0.914921 00:11
2 0.658059 0.845227 00:11
with learn.distrib_ctx(torch.cuda.device_count()-1):
epoch train_loss valid_loss time
0 0.749317 0.997568 00:11
1 0.580846 0.912386 00:11
2 0.381058 0.878295 00:11
with learn.parallel_ctx():
epoch train_loss valid_loss time
0 0.514148 1.025872 00:25
1 0.383893 0.996381 00:18
2 0.204836 0.970403 00:18
nn.DataParallel(learn.model)
epoch train_loss valid_loss time
0 0.341708 1.103849 00:16
1 0.272570 1.067705 00:16
2 0.134262 1.055507 00:16
Using the command nvidia-smi dmon -s u
to watch GPU usage, I can see that only the training with DataParallel
(using with learn.parallel_ctx():
and nn.DataParallel(learn.model)
) show GPU ids 1,2,3 being used. The problem is the data parallel is slower, even when I have tried increasing batch size or embedding size.
Any help with this would be appreciated. I have a much larger collaborative filtering model I would like to use that is experiencing the same issues as this movie example and I need to reduce the training time hopefully with the use of parallel/distributed training.