I was training a TFT model on a colab GPU. It trained, but still it was relatively slow because TFT is a big model. I wanted to train the model on a oclab TPU, but it cannot get started. It gets to Epoch 0 and it freezes. My questions are: Is the code below ok in terms of TPU utilization?
max_prediction_length = len(test)
max_encoder_length = 4*max_prediction_length
# training_cutoff = df_19344_tmp["time_idx"].max() - max_prediction_length
training = TimeSeriesDataSet(
train,
time_idx='time_idx',
target='occupancy',
group_ids=['property_id'],
min_encoder_length=max_encoder_length,
max_encoder_length=max_encoder_length,
min_prediction_length=max_prediction_length,
max_prediction_length=max_prediction_length,
static_categoricals=['property_id'],
static_reals=[],
time_varying_known_categoricals=[],
time_varying_known_reals=['time_idx', 'sin_day', 'cos_day', 'sin_month', 'cos_month', 'sin_year', 'cos_year'],
time_varying_unknown_categoricals=[],
time_varying_unknown_reals=[],
target_normalizer=GroupNormalizer(
groups=['property_id'], transformation="softplus"
),
add_relative_time_idx=True,
add_target_scales=True,
add_encoder_length=True,
allow_missing_timesteps=True
)
validation = TimeSeriesDataSet.from_dataset(training, train, predict=True, stop_randomization=True)
batch_size = 32 # set this between 32 to 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=0)
trainer = pl.Trainer(
max_epochs=100,
# accelerator='cpu',
accelerator='tpu',
devices=1,
enable_model_summary=True,
auto_lr_find=False,
# clipping gradients is a hyperparameter and important to prevent divergance
# of the gradient for recurrent neural networks
gradient_clip_val=0.1,
check_val_every_n_epoch=None
)
tft = TemporalFusionTransformer.from_dataset(
training,
# not meaningful for finding the learning rate but otherwise very important
learning_rate=0.0005,
hidden_size=8, # most important hyperparameter apart from learning rate
# number of attention heads. Set to up to 4 for large datasets
attention_head_size=1,
dropout=0.1, # between 0.1 and 0.3 are good values
hidden_continuous_size=8, # set to <= hidden_size
output_size=7, # 7 quantiles by default
loss=QuantileLoss(),
)
And my second question is, does TFT from pytorch-forecasting even support TPU training?
This is where the model freezes when training on a colab TPU: