Hi
I’m new to working with timelines, but I have a problem to which I am not able to find any good resources.
So I would appreciate if anyone could give me some pointers.
So in my case I’m interested in predicting an event, for a user of a website.
Per user, i have a timeline of their usage of the website in the form of a dataframe.
Say i would want to predict, if a user leaves the website.
My goal now is, to look at a certain time frame (lets say the last 5 minutes, so that would be 5 rows of my dataframe, if i’m aggregating data per minute) and predict the probability of a user leaving the website.
A data sample would be (for one user)
Minute Feature_1 Feature_2 Feature_3 ... Target_0 Target_1
-4 0.4 0.23 0.64 1 0
-3 0.24 0.23 0.64 1 0
-2 0.34 0.1 0.64 1 0
-1 0.56 0.2 0.64 1 0
0 0.64 0.3 0.64 0 1
Where the last row is the most recent observation of this user. Target describes, if the user has left the website at this point.
Now my goal would be, to give the network a set amount of timesteps (lets say 4) and then let it predict how likely it is, for a given user to stay on the website, or leave.
Currently I’m feeding my network a batch and then predict the target values for the last timestep in the series.
Using this code:
lookback = 4
batch_size = 8
layer_size = 64
learning_rate = 5e-4
dataset_name = 'Timeline_Dataset'
class MY_FIRST_GRU(nn.Module):
def __init__(self):
super(MY_FIRST_GRU, self).__init__()
self.gru = nn.GRU(input_size=32,
hidden_size=20,
num_layers=2,
batch_first=True)
self.l_out = nn.Linear(in_features=20,
out_features=2)
def forward(self, batch):
x, x_length, _ = batch
x_pack = pack_padded_sequence(x, x_length, batch_first=True).float()
packed_x, hidden = self.gru(x_pack)
output_padded ,input_sizes = pad_packed_sequence(packed_x, batch_first=True)
output = self.l_out(output_padded)
#return output
return F.log_softmax(output, dim=1),
def train(train_loader):
dl_model.train()
total_loss = 0
correct = 0
for data_list in train_loader:
optimizer.zero_grad()
output = dl_model(data_list) # shape = [8, 4, 2]
y = data_list[2]
batch_ce_loss = 0.0
loss = F.hinge_embedding_loss(output[0], y.long())
loss.backward()
total_loss += loss.item()
with torch.no_grad():
pred = output[0]
correct += pred.eq(y.long()).sum().item()
optimizer.step()
return total_loss / len(train_dataset), correct / len(train_loader.dataset)
def test(loader):
dl_model.eval()
actuals = []
probabilities = []
correct = 0
for data_list in test_loader:
with torch.no_grad():
output = dl_model(data_list)
y = data_list[2]
pred = output[0]
actuals.extend(y.cpu().detach().numpy())
correct += pred.eq(y.long()).sum().item()
return correct / len(loader.dataset), actuals
dataset = Timeline_Dataset('./data/', lookback)
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset,[train_size, test_size])
train_loader = DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True, # use custom collate function here
collate_fn=PadSequences(),
pin_memory=True,
num_workers=1)
test_loader = DataLoader(dataset=test_dataset,
batch_size=batch_size,
shuffle=True,
collate_fn=PadSequences(),
pin_memory=True,
num_workers=1)
epochs = 100
if torch.cuda.is_available():
print('Using GPU '+ str(torch.cuda.current_device()) + ' as main GPU')
else:
print('Using CPU')
t0 = time.time()
for epoch in range(1, epochs + 1):
loss, train_acc = train(train_loader)
test_acc, actuals = test(test_loader)
Is there a way, predict for every timestep and then calculate the loss based on all of the outputs?