Simple CNN-based baseline for animal sparse pose estimation

I am interested in a simple CNN-based baseline for detecting the landmarks in an animal given I have groundtruth for the animal.
Here is a screenshot from a recent paper with this regard:

What would be a good starter code? I am trying not to poke into something complex like openpose/Simple Baseline by Microsoft (ECCV 2018), or DeeperCut. I am looking for something quite simple that can predict the landmarks in a supervised manner. It is totally cool though, if it could do some sort of transfer learning from human pose literature (if that doesn’t make the learning worse).

For starter, I have an annotated dataset of 800 frames and four landmarks.

link 1. https://www.biorxiv.org/content/biorxiv/early/2018/07/28/377895.full.pdf
link 2. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8858533

Please let me know if you would be more interested in further information.

P.S.: I recently came across a paper that does domain adaptation for animal pose estimation from human pose estimation. I am not sure how much control I would have over implementing that though or how easy is it to generalize it to any animal. Cross-Domain Adaptation for Animal Pose Estimation

Is there any simple baseline for animal pose estimation yet available with PyTorch?