Running out of Memory Easily with TemporalFusionTransformer

I tried to play with Pytorch’s TemporalFusionTensor model (beginner level in Pytorch) following this guide here.

I’m dumping it all on the CPU and are not using a GPU. To my surprise, even for such a small dataset, Pytorch runs out of memory, and Linux is killing the process. I’ve tried a number of things but the only thing that really makes a difference is to reduce the number of hidden layers by setting hidden_size=160 → 16.

I couldn’t run this on a 128GB RAM machine and went to one with 512GB when hidden_size=160 and hidden_continuous_size=160.

I generated another dataset myself, univariate timeseries that is [84961 rows x 8 columns], and for a hidden_size=64, hidden_continuous_size=32 it consumes 80GB RAM. Am I missing something or it should be that hungry? Is there anything I can do to optimize the memory consumption?

Here’s the code

import numpy as np
import pandas as pd
from lightning.pytorch.callbacks import EarlyStopping, LearningRateMonitor
from lightning.pytorch.loggers import TensorBoardLogger
from pytorch_forecasting import TimeSeriesDataSet, GroupNormalizer, TemporalFusionTransformer, QuantileLoss, RMSE, MAE, \
    MAPE
import lightning.pytorch as pl
from torchmetrics import R2Score

data = pd.read_csv('LD2011_2014.txt', index_col=0, sep=';', decimal=',')
data.index = pd.to_datetime(data.index)
data.sort_index(inplace=True)
data.head(5)
print(data)

data = data.resample('1h').mean().replace(0., np.nan)
earliest_time = data.index.min()
df = data[['MT_002', 'MT_004', 'MT_005', 'MT_006', 'MT_008']]
print(df)

df_list = []

for label in df:
    ts = df[label]

    start_date = min(ts.fillna(method='ffill').dropna().index)
    end_date = max(ts.fillna(method='bfill').dropna().index)

    active_range = (ts.index >= start_date) & (ts.index <= end_date)
    ts = ts[active_range].fillna(0.)

    tmp = pd.DataFrame({'power_usage': ts})
    date = tmp.index

    tmp['hours_from_start'] = (date - earliest_time).seconds / 60 / 60 + (date - earliest_time).days * 24
    tmp['hours_from_start'] = tmp['hours_from_start'].astype('int')

    tmp['days_from_start'] = (date - earliest_time).days
    tmp['date'] = date
    tmp['consumer_id'] = label
    tmp['hour'] = date.hour
    tmp['day'] = date.day
    tmp['day_of_week'] = date.dayofweek
    tmp['month'] = date.month

    # stack all time series vertically
    df_list.append(tmp)

time_df = pd.concat(df_list).reset_index(drop=True)

# match results in the original paper
time_df = time_df[(time_df['days_from_start'] >= 1096)
                  & (time_df['days_from_start'] < 1346)].copy()
print(time_df)

time_df[['consumer_id', 'power_usage']].groupby('consumer_id').mean()
print(time_df)

#---------------------------------------------------


# Hyperparameters
# batch size=64
# number heads=4, hidden sizes=160, lr=0.001, gr_clip=0.1

max_prediction_length = 24
max_encoder_length = 7 * 24
training_cutoff = time_df["hours_from_start"].max() - max_prediction_length

training = TimeSeriesDataSet(
    time_df[lambda x: x.hours_from_start <= training_cutoff],
    time_idx="hours_from_start",
    target="power_usage",
    group_ids=["consumer_id"],
    min_encoder_length=max_encoder_length // 2,
    max_encoder_length=max_encoder_length,
    min_prediction_length=1,
    max_prediction_length=max_prediction_length,
    static_categoricals=["consumer_id"],
    time_varying_known_reals=["hours_from_start", "day", "day_of_week", "month", 'hour'],
    time_varying_unknown_reals=['power_usage'],
    target_normalizer=GroupNormalizer(
        groups=["consumer_id"], transformation="softplus"
    ),  # we normalize by group
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True,
)

validation = TimeSeriesDataSet.from_dataset(training, time_df, predict=True, stop_randomization=True)

# create dataloaders for  our model
batch_size = 32
# if you have a strong GPU, feel free to increase the number of workers
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=0)

early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=5, verbose=True, mode="min")
lr_logger = LearningRateMonitor()
logger = TensorBoardLogger("/lightning_logs")

trainer = pl.Trainer(
    max_epochs=45,
    accelerator='cpu',
    devices=1,
    enable_model_summary=True,
    gradient_clip_val=0.1,
    callbacks=[lr_logger, early_stop_callback],
    logger=logger)

tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.001,
    hidden_size=160,
    attention_head_size=4,
    dropout=0.1,
    hidden_continuous_size=160,
    output_size=7, 
    loss=QuantileLoss(),
    logging_metrics=[MAE(), RMSE(), MAPE(), R2Score()],
    log_interval=10,
    reduce_on_plateau_patience=4)

trainer.fit(
    tft,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader)


best_model_path = trainer.checkpoint_callback.best_model_path
print(best_model_path)
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)

If it’s just that memory-intensive how do I then run a multivariate timeseries? The dataset always fits easily into RAM, but then during processing it explodes. Is there a way to do this via distributed training on a cloud environment (eg. Google Cloud)?

Am I the first to come across this problem?

Did you solve the problem? I am facing the same problem just 50GB RAM with 33k time series, maximum of 48 time steps look back. hidden_size=16, attention_head_size=2, hidden_continuous_size=8.