Why subtract first pose from rest of sequence in DeepVO implementation?

Hi community. Currently I am trying to implement DeepVO paper by S. Wang. So I wonder, why do we need to get reative pose w.r.t. the first frame in a sequence?