How to use annotations in xml to train dataset

Hi!, i am using SVT dataset (link - http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset) It contains street view text and i want to train a model to learn to label text in those images. The problem is i dont know how to deal with the annotations provided in the xml file, i mean i have the input(images to the network) but how should i provide the output(labels for each image) . Please help ASAP