The summary of 'SUNK: Slurm on Kubernetes (Presented by Navarre Pratt at Supercomputing)'

This summary of the video was created by an AI. It might contain some inaccuracies.

The video showcases Corweave's integration of Slurm on Kubernetes, known as Sunkum, to efficiently manage high-end training clusters for GPU workloads in fields like AI, ML, VFX, and CGI rendering. The integration aims to simplify compute management by combining Slurm and Kubernetes, offering benefits like high availability, easy scaling, and efficient resource allocation. Key components include Sunk (Slurm on Kubernetes), AXS for user connectivity, node sets for deployment, and Slurm Sinker for communication. The speaker emphasizes effective resource management, workload orchestration, and collaboration with Nvidia, highlighting the benefits of utilizing Kubernetes for cluster management efficiency and hardware issue diagnosis. Notably, there is a project set for open-sourcing in 2024, recognizing key team members like Peter Sinki and Andrew Sanitar.

00:00:00

In this segment of the video, the speaker, Nar Pratt, discusses Corweave’s integration of Slurm on Kubernetes, which they call Sunkum. Corweave is a specialized cloud provider focused on GPU workloads for fields like AI, ML, VFX, CGI rendering, and more. They run bare metal Kubernetes clusters across all compute nodes and leverage Nvidia’s H100 nodes for ML training clusters. Corweave uses Kubernetes as the foundation for its compute delivery, offering a seamless experience for clients with no proprietary APIs. They operate both multi-tenant and single-tenant data centers, with multi-tenant clusters serving bursty workloads and single-tenant clusters catering to large-scale ML training clusters. The hardware setup includes eight Nvidia H100s per node and high-speed InfiniBand interconnectivity. These details make Corweave’s infrastructure ideal for powering high-end training clusters efficiently.

00:03:00

In this segment of the video, the speaker discusses how they leverage partners and build on top of Kubernetes to offer products like Slurm. By integrating Slurm on top of Kubernetes, they aim to simplify compute management for their clients. They emphasize the need to combine Slurm and Kubernetes to manage workloads efficiently and programmatically scale up and down. The speaker introduces Sunk (Slurm on Kubernetes), which packages everything needed to run a Slurm cluster as Kubernetes resources, including configurations and authentication options. This integration allows users to manage Slurm within Kubernetes, retaining all Slurm features while benefiting from Kubernetes’ ecosystem and management capabilities.

00:06:00

In this part of the video, the speaker discusses the benefits of using Kubernetes with Slurm. Key points include high availability of Slurm control plane services, easy scaling of nodes in the cluster, management of shared file systems through persistent volume claims, and state management integration between Slurm and Kubernetes. The custom Kubernetes scheduler in the Slurm cluster allows for scheduling Kubernetes workloads on Slurm nodes. A high-level diagram is used to explain the design, which includes various Kubernetes resources for cluster-wide deployment, compute nodes, and components of a Slurm cluster running as Kubernetes pods. Fine-tuning resources for different components can optimize costs in a cloud environment.

00:09:00

In this segment of the video, the speaker discusses the AXS as the entry point for Slurm users to connect through SSH or other authentication methods. Once connected to the Slurm login pods, users interact with the Slurm cluster as usual without needing to know about Kubernetes. Configuration for the Slurm cluster is handled through config maps and secrets, including topology, prologue/epilog scripts, and Slurm config. Changes to the Slurm config can be directly made in the config map, synced to compute and login nodes without restart. The node sets manage the deployment of Slurm compute pods on physical nodes. The Slurm Sinker syncs status between Slurm and Kubernetes, serving as a gateway. Operators ensure consistency in the cluster, and Grafana dashboards visualize cluster metrics ingested via Prometheus.

00:12:00

In this part of the video, it explains how to deploy GPU workloads in Kubernetes by requesting a special sunk accelerator resource. The transcript discusses running nod set slurm compute pods on nodes in a slurm cluster and deploying Kubernetes workloads on top. It details the statuses of nodes in the cluster and the capabilities for protected rolling updates. The slurm Sinker, comprising the slurm client and the Sinker pod controller, facilitates communication between Kubernetes and slurm. An example is provided on how information flows from slurm to Kubernetes when launching new jobs within slurm. The process involves reaching out to the slurm Sinker to manage node resources for running jobs.

00:15:00

In this segment of the video, the speaker explains how Kubernetes nodes are managed to ensure only designated workloads run on them. They describe a process where nodes are cordoned to perform maintenance or isolate issues. The interaction between Kubernetes and Slurm clusters is detailed, showing how resources are managed seamlessly for different types of workloads. The use of a custom scheduler called Sunk enables efficient resource allocation for both Kubernetes and Slurm jobs. The speaker also mentions Argo CD, a tool used at their organization to deploy and manage Slurm clusters. Overall, the segment highlights the integration of Kubernetes and Slurm clusters for effective resource management and workload orchestration.

00:18:00

In this segment of the video, the speaker discusses the benefits of using Kubernetes to manage clusters efficiently. They highlight the ease of rolling out and testing changes in a Storm cluster, showcasing the ability to deploy and take down clusters within minutes for testing purposes. The speaker also introduces a system called failings that conducts thorough node health checks and hardware tests using Argo workflows on idle compute nodes. These checks are run selectively on nodes not actively utilized by SLURM jobs to ensure minimal disruption to customer workflows. Additionally, the speaker mentions how Prometheus metrics are exposed for monitoring SLURM clusters, demonstrating the use of Grafana dashboards for visualizing cluster states. They share a specific screenshot of a Grafana dashboard showcasing metrics and discuss their collaboration with Nvidia on submitting ML Perf benchmark results using Corweave’s infrastructure. The ability to quickly diagnose hardware issues and work closely with Nvidia is attributed to the efficiency of managing everything on top of Kubernetes and through the use of SLURM.

00:21:00

In this segment of the video, the speaker discusses the success of a project that helped them scale rapidly. They mention plans to open source it in 2024 and acknowledge key team members, including Peter Sinki and Andrew Sanitar. The speaker encourages viewers to stay tuned for marketing material and announcements from Corway as they move towards the release.

The summary of ‘SUNK: Slurm on Kubernetes (Presented by Navarre Pratt at Supercomputing)’

00:00:00 – 00:21:55