Optimizing GPU Utilization for AI/ML Workloads on AWS EKS

Modern AI/ML workloads on Kubernetes often underutilize GPUs when each pod gets an entire device. AWS EKS can leverage advanced sharing features to squeeze more work out of each GPU. NVIDIA provides several mechanisms – Multi-Instance GPU (MIG), CUDA Multi-Process Service (MPS), and time-slicing – that allow multiple pods or processes to share a GPU’s resources. Broadly:

MIG partitions Ampere GPUs (e.g. A100/H100) into isolated mini-GPUs.
MPS runs kernels from different processes in parallel on the same GPU.
Time-slicing interleaves GPU execution among containers.

These options trade off performance isolation, throughput, and complexity.

NVIDIA Multi-Instance GPU (MIG)

MIG is a hardware partitioning feature (available on A100/H100) that divides one GPU into up to seven independent instances. Each slice has its own SMs, memory and cache, providing strict isolation and predictable performance.

On AWS, a P4d instance (8×A100) can be labeled so that each GPU creates 7 “1g.5gb” MIG devices – allowing up to 56 pods per node.
Example: One A100 split into 7 MIGs showed 4.17× higher throughput (but each pod had ~1.7× higher latency).
AWS reports ~2.5× throughput gains on P4d clusters using MIG.

Setup on EKS:

Use MIG-compatible instances (P4d, G5n, etc.).
Deploy the NVIDIA GPU Operator or install drivers/toolkit manually.
Label nodes with nvidia.com/mig.config=all-1g.5gb.

Pod spec requests:

resources:
  limits:
    nvidia.com/mig-1g.5gb: 1

Benefits:

Strict performance isolation.
Predictable behavior per slice.
Tailor slice sizes for different models.

Limitations:

Supported only on A100/H100 (Ampere/Hopper).
Adds config complexity.
MIG slices reduce raw performance per job.
No simultaneous profiling sessions across MIG slices.

CUDA Multi-Process Service (MPS)

MPS allows multiple processes (or pods) to share one GPU context and run kernels concurrently via a central MPS daemon.

Useful for overlapping lightweight inference workloads.
Newer CUDA versions allow GPU memory and thread % limits per client.

Setup on EKS:

Use supported GPUs (e.g. T4, A100).
Start MPS control daemon on node.
Use plugins (e.g., Nebuly’s NOS plugin) to expose sharing.mps mode.

Pod resource example:

resources:
  limits:
    nvidia.com/gpu-memory: 4Gi

Benefits:

Concurrent kernel execution.
Works on most NVIDIA GPUs.
High aggregate throughput for inference/batch jobs.

Limitations:

Poor isolation: a misbehaving pod can crash all.
Shared memory: prone to contention.
Manual MPS daemon management.
Not compatible with MIG on the same device.

GPU Time-Slicing (Fractional GPUs)

Time-slicing configures the NVIDIA device plugin to treat one physical GPU as multiple vGPUs. Internally, it uses context switching.

E.g., with 10× time-slice, one GPU is sliced into 10 parts.
Allows 10 pods to run on a single GPU with nvidia.com/gpu: 1.

Setup on EKS:

Patch the device plugin ConfigMap:

sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu

Restart the nvidia-device-plugin-daemonset.

Benefits:

Easy to configure.
Higher pod density on single GPU.
Great for dev/test, low-usage inference.

Limitations:

No performance isolation.
Shared memory; one pod can crash the GPU.
Overhead from GPU context switching.
Not suitable for latency-sensitive jobs.

Cluster Design Best Practices:

Use GPU instances that match sharing strategy:
- MIG → A100/H100 (P4d, G5n).
- MPS/Time-slice → T4, G4, A10G.
Use Bottlerocket or GPU-optimized AMIs.
Auto-scale GPU nodes with Karpenter.

Install GPU Operator:

Handles driver installation, NVIDIA Container Toolkit, device plugin.
Includes DCGM exporter for monitoring.

Configure Device Plugin:

Time-slicing:

migStrategy: none
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu

MIG:

migStrategy: single

Label node:

kubectl label node <node> nvidia.com/mig.config=all-1g.5gb

Scheduling & Resources:

Label and taint GPU nodes.

Match pod requests to available slice type:

resources:
  limits:
    nvidia.com/mig-1g.5gb: 1

For time-slice or MPS, request GPU memory limits if supported.

Monitoring and Observability Tools

Tool	Use Case
NVIDIA DCGM Exporter	GPU metrics to Prometheus/Grafana
GPU Feature Discovery	Labels GPU nodes with capabilities
Karpenter	Auto-scaling GPU nodes based on demand
Nebuly Nos Plugin	Advanced sharing with MPS support

Summary: When to Use What?

Sharing Method	Best For	Isolation	Hardware Support	Complexity
MIG	Multi-tenant inference	High	A100, H100 only	Medium
MPS	Throughput-heavy batch jobs	Low	Most NVIDIA GPUs	Medium
Time-slicing	Lightweight, cost-sensitive workloads	None	Most NVIDIA GPUs	Low

Conclusion

Efficient GPU utilization is no longer a luxury—it’s a necessity for scaling AI/ML workloads cost-effectively. Techniques like MIG, CUDA MPS, and time-slicing unlock hidden performance and dramatically improve ROI, especially in Kubernetes environments like EKS.

By combining these strategies with robust observability and auto-scaling, organizations can minimize idle resources while maintaining performance for critical inference or training pipelines.

Whether you’re just beginning to explore GPU sharing or you’re optimizing large-scale ML pipelines, mastering these techniques gives your team a competitive edge.

KubeAce Support

At KubeAce, we help organizations design, deploy, and optimize AI/ML infrastructure on Kubernetes—especially on Elastic Kubernetes Engine (EKS). Our engineers bring deep expertise in GPU workload management, including:

Configuring MIG on NVIDIA A100s in EKS Autopilot or Standard clusters
Tuning MPS and time-slicing for cost-effective inference at scale
Integrating monitoring and observability with Prometheus, DCGM, and Grafana
Ensuring workloads auto-scale efficiently without waste

If you’re struggling with GPU underutilization or want to make the most of your ML infrastructure investment, we’re here to help.

Contact us at info@kubeace.com Visit kubeace.com

Let’s build intelligent, efficient, and scalable AI platforms—together.

Optimizing GPU Utilization for AI/ML Workloads on AWS EKS

NVIDIA Multi-Instance GPU (MIG)

Setup on EKS:

Benefits:

Limitations:

CUDA Multi-Process Service (MPS)

Setup on EKS:

Benefits:

Limitations:

GPU Time-Slicing (Fractional GPUs)

Setup on EKS:

Benefits:

Limitations:

Implementing GPU Sharing on AWS EKS

Cluster Design Best Practices:

Install GPU Operator:

Configure Device Plugin:

Scheduling & Resources:

Monitoring and Observability Tools

Summary: When to Use What?

Conclusion

KubeAce Support