I’m currently working with TorchX in conjunction with Volcano scheduling for my training jobs on an Amazon EKS cluster. I’ve also integrated Karpenter autoscaler for effective node scaling. Additionally, I’m using managed node groups with labeled nodes that have specific taints applied.
Our internal data and machine learning teams have the requirement to specify NodeSelectors and Tolerations to target jobs on particular nodes or managed node groups. While referring to the documentation provided here: TorchX Specifications, I observed that capabilities={“node.kubernetes.io/instance-type”: “”} are used as NodeSelectors when the job is created through Volcano. However, this approach doesn’t seem to allow for sending a list of labels, which our use case demands.
Furthermore, I’m also interested in incorporating tolerations into these jobs to ensure proper scheduling and execution in our environment. If any of you have experience in implementing NodeSelectors and Tolerations in TorchX within an Amazon EKS setup, I would highly appreciate your insights and advice.
If there’s no previous experience with this scenario, I’m considering raising a feature request to address these needs. Your guidance and input would be greatly valued.
I’ve run into something similar with TorchX on EKS. As you mentioned, the capabilities field maps to NodeSelectors, but it’s pretty limited; it doesn’t handle multiple labels or tolerations natively. What worked for us was customizing the scheduler or extending the TorchX component to add the extra Kubernetes fields manually before submission.
If you’re using a custom scheduler or running TorchX through a wrapper script, you could patch the generated pod spec using the Volcano Job API or a mutating webhook to inject the NodeSelectors and Tolerations. Not ideal, but it gives you full control.
I’d support a feature request for better native support for these Kubernetes features in TorchX would help a lot in more complex EKS setups like yours.