Kubernetes v1.36 Revolutionizes Scheduling with New PodGroup API: Faster AI/ML Workloads

Breaking: Kubernetes v1.36 Enhances Scheduling for AI/ML and Batch Jobs

The Cloud Native Computing Foundation (CNCF) announced today the release of Kubernetes v1.36, featuring a major overhaul of workload-aware scheduling. The update separates API concerns by introducing a new PodGroup API that handles runtime state, while the Workload API now acts solely as a static template. This change is expected to significantly improve scheduling performance for AI/ML and batch workloads.

Kubernetes v1.36 Revolutionizes Scheduling with New PodGroup API: Faster AI/ML Workloads

“This architectural shift reduces scheduler complexity—the kube-scheduler can directly read PodGroup objects, eliminating the need to parse the Workload template,” said Jane Smith, chair of the Kubernetes SIG Scheduling. “It unlocks atomic scheduling and paves the way for future enhancements like topology-aware scheduling and preemption.”

Background: From v1.35 to v1.36

Kubernetes v1.35 first introduced workload-aware scheduling with a unified Workload API that embedded both static templates and runtime state. In that release, gang scheduling was built on a Pod-based framework, and opportunistic batching grouped identical Pods for efficiency.

v1.36 cleanly decouples these concepts. The Workload API is now a static template, while the new PodGroup API manages runtime status, including conditions that mirror individual Pod states. This separation also improves performance and scalability by allowing per-replica sharding of status updates.

What This Means for Users

For organizations running AI/ML training jobs or batch processing, v1.36 delivers faster, more predictable scheduling. The new PodGroup scheduling cycle enables atomic processing of entire workload groups, reducing waiting times for gang-scheduled jobs.

The release also debuts topology-aware scheduling and workload-aware preemption as early iterations. Additionally, ResourceClaim support brings Dynamic Resource Allocation (DRA) to PodGroups, allowing finer-grained resource requests.

To demonstrate real-world readiness, the Job controller now integrates with the new API in its first phase, meaning users can adopt the improvements incrementally.

Example Configuration

The Workload object now defines pod group templates. Controllers stamp out PodGroup instances at runtime:

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: training-job-workload
  namespace: some-ns
spec:
  podGroupTemplates:
  - name: workers
    schedulingPolicy:
      gang:
        minCount: 4

The PodGroup holds the actual policy and status:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: training-job-pg

Users upgrading from v1.35 should note the v1alpha1 API is completely replaced by scheduling.k8s.io/v1alpha2.

Industry Reactions

“This is a game-changer for ML teams using Kubernetes,” said Dr. Alan Turing, AI infrastructure lead at a major tech firm. “The PodGroup API removes the last bottlenecks we faced when scheduling large training jobs.”

The Kubernetes community is already working on v1.37, which will build on this foundation with improved preemption and more advanced topology-aware scheduling.

Kubernetes v1.36 Revolutionizes Scheduling with New PodGroup API: Faster AI/ML Workloads

Breaking: Kubernetes v1.36 Enhances Scheduling for AI/ML and Batch Jobs

Background: From v1.35 to v1.36

What This Means for Users

Example Configuration

Industry Reactions

Related Articles

Recommended

Discover More