How to use annotations in xml to train dataset

Hi!, i am using SVT dataset (link - It contains street view text and i want to train a model to learn to label text in those images. The problem is i dont know how to deal with the annotations provided in the xml file, i mean i have the input(images to the network) but how should i provide the output(labels for each image) . Please help ASAP