Home/ CSE/ Kubernetes for Edge Inference
CSE · Seminar 10 · Orchestrating ML models across the edge

Kubernetes for Edge Inference

Lightweight Kubernetes distributions extend container orchestration to thousands of resource-constrained edge nodes, enabling scalable, GPU-aware model serving close to data.

KubernetesK3sKubeEdgeedge AImodel servingGPU

Running inference at the edge — on factory gateways, retail stores or 5G base stations — cuts latency and bandwidth versus the cloud. But managing models across thousands of distributed, intermittently-connected nodes is hard. Kubernetes, the de-facto container orchestrator, is being adapted to do exactly this.

Working principle

Standard Kubernetes is too heavy for edge hardware, so distributions like K3s strip it to a single small binary, and KubeEdge splits the control plane (CloudCore) from an edge agent (EdgeCore) that keeps pods running even when the link to the cloud drops. A model is packaged as a container, scheduled to nodes matching GPU/accelerator labels, and exposed through a serving runtime (KServe, Triton) that handles batching and autoscaling.

syncsyncsyncpullinferinferinferCloud control planeModel / image registryEdge node + GPUEdge node + NPUEdge node (CPU)Sensors / usersCloud-managed control plane orchestrating heterogeneous edge nodes
Figure 1. KubeEdge-style topology. The cloud manages desired state; edge nodes pull model images and serve inference locally, surviving network partitions.
Table 1. Cloud K8s vs. edge-optimised distributions
PropertyVanilla K8sK3s / KubeEdge
FootprintHundreds of MB~50–100 MB
Offline operationLimitedEdge autonomy on disconnect
Target nodeServer / VMARM gateway, IoT box
DatastoreetcdSQLite / lightweight
Design principleEdge clusters must tolerate unreliable networks and heterogeneous accelerators. Node labels, taints and device plugins steer each model to compatible hardware; local autonomy keeps inference alive during outages.

Applications

  • Real-time computer vision in manufacturing and retail
  • Telco / 5G MEC workloads requiring single-digit-ms latency
  • Federated fleets of stores, vehicles or smart-city cameras

References & further reading

  1. Xiong et al., “Extend Cloud to Edge with KubeEdge,” IEEE/ACM SEC 2018.
  2. Burns et al., “Borg, Omega, and Kubernetes,” ACM Queue, 2016.
  3. KServe & NVIDIA Triton Inference Server documentation, 2024–2025.