I’m currently working with TorchX in conjunction with Volcano scheduling for my training jobs on an Amazon EKS cluster. I’ve also integrated Karpenter autoscaler for effective node scaling. Additionally, I’m using managed node groups with labeled nodes that have specific taints applied.
Our internal data and machine learning teams have the requirement to specify NodeSelectors and Tolerations to target jobs on particular nodes or managed node groups. While referring to the documentation provided here: TorchX Specifications, I observed that capabilities={“node.kubernetes.io/instance-type”: “”} are used as NodeSelectors when the job is created through Volcano. However, this approach doesn’t seem to allow for sending a list of labels, which our use case demands.
Furthermore, I’m also interested in incorporating tolerations into these jobs to ensure proper scheduling and execution in our environment. If any of you have experience in implementing NodeSelectors and Tolerations in TorchX within an Amazon EKS setup, I would highly appreciate your insights and advice.
If there’s no previous experience with this scenario, I’m considering raising a feature request to address these needs. Your guidance and input would be greatly valued.