CAUTION: This is an beta / non-production software, do not use on production clusters.
Intel GPU Base operator allows automatic deployment of GPU related components to enable use of Intel GPU hardware within the Kubernetes cluster.
Intel GPU Base operator deploys components to a cluster to make it possible to run Intel GPU workloads on the cluster nodes. It can configure the cluster with either Device Plugin or Dynamic Resource Allocation (DRA) based approach. In addition to registering GPU resources for the cluster, it can also deploy Intel XPU-Manager. If the cluster has Prometheus enabled, operator can deploy the required Kubernetes components to integrate GPU metrics to the Prometheus database.
Operator does its best to detect cluster features and to only deploy components that are supported. For example, GPU DRA driver is only deployed from certain Kubernetes version onwards.
When the operator is requested with device plugin, resource monitoring:
apiVersion: intel.com/v1alpha1
kind: ClusterPolicy
metadata:
name: gpu-dp
spec:
resourceRegistration: dp
useNFDLabeling: true
resourceMonitoring: true
the end result will look something like this:
$ kubectl get pods -n intel-gpu-base-operator
NAME READY STATUS RESTARTS AGE
gpu-dp-device-plugin-5txt5 2/2 Running 0 2m4s
gpu-dp-xpu-manager-hdg6b 2/2 Running 0 2m4s
intel-gpu-base-operator-controller-manager-749cb69d6c-kqnzw 1/1 Running 0 124m
Device Plugin GPU resources can be seen like so:
$ kubectl get node node1 -o yaml | yq .status.allocatable
...
gpu.intel.com/i915: "1"
gpu.intel.com/i915_monitoring: "1"
...
When the operator is requested with DRA, resource monitoring and with NFD:
apiVersion: intel.com/v1alpha1
kind: ClusterPolicy
metadata:
name: gpu-dra
spec:
resourceRegistration: dra
useNFDLabeling: true
resourceMonitoring: true
the end result will look something like this:
$ kubectl get pods -n intel-gpu-base-operator
NAME READY STATUS RESTARTS AGE
gpu-dra-gpu-dra-rqr94 1/1 Running 0 88s
gpu-dra-xpu-manager-cqsbj 2/2 Running 0 88s
intel-gpu-base-operator-controller-manager-749cb69d6c-kqnzw 1/1 Running 0 147m
DRA GPU resource slices can be seen like so:
$ kubectl get resourceslices.resource.k8s.io
NAME NODE DRIVER POOL AGE
node1 node1 gpu.intel.com node1 111s
| Dependency | Level | Purpose | Chart integration |
|---|---|---|---|
| Cert Manager | Mandatory | The operator uses webhooks which have hooks to cert-manager's TLS apply. | None |
| Node Feature Discovery | Recommended | Allows scheduling Pods only to Nodes with Intel GPU hardware. | Optionally installed |
| Kueue | Recommended | Enables advanced Pod scheduling mechanisms. | Optionally installed |
| Prometheus | Recommended | Accesses GPU metrics from all Nodes from one endpoint. | None |
The preferred installation method to the cluster via our Helm charts.
Helm deployment is split into two charts: operator and policy. The reason for this split is to allow the operator to run cleanup before it is removed from the cluster. DRA especially is problematic as Pods using its resources (e.g. XPU Manager) will get stuck at Terminating if the DRA plugin is removed from the cluster.
The basic installation is as follows. The operator:
kubectl create ns intel-gpu-operator
# Required by DRA's admin access
kubectl label ns intel-gpu-operator resource.kubernetes.io/admin-access=true
helm install --namespace "intel-gpu-operator" --version 0.0.1 \
gpu-operator oci://ghcr.io/intel/gpu-base-operator/intel-gpu-base-operator-chart
helm install --namespace "intel-gpu-operator" --version 0.0.1 \
gpu-policy oci://ghcr.io/intel/gpu-base-operator/intel-gpu-base-operator-policy-chart
By default it installs the operator and a basic Device Plugin enabled deployment with Intel XPU Manager. Node Feature Discovery and Kueue may be installed with the operator chart. The installation depends on kueue.install and nfd.install parameters.
helm install --namespace "intel-gpu-operator" --version 0.0.1 \
gpu-operator oci://ghcr.io/intel/gpu-base-operator/intel-gpu-base-operator-chart
helm install --namespace "intel-gpu-operator" --version 0.0.1 \
--set resourceRegistration=dra \
gpu-policy oci://ghcr.io/intel/gpu-base-operator/intel-gpu-base-operator-policy-chart
helm install --namespace "intel-gpu-operator" --version 0.0.1 \
--set nfd.install=true \
--set kueue.install=true \
gpu-operator oci://ghcr.io/intel/gpu-base-operator/intel-gpu-base-operator-chart
helm install --namespace "intel-gpu-operator" --version 0.0.1 \
--set resourceRegistration=dra \
--set useNFDLabeling=true \
--set enableKueue=true \
gpu-policy oci://ghcr.io/intel/gpu-base-operator/intel-gpu-base-operator-policy-chart
helm install --namespace "intel-gpu-operator" --version 0.0.1 \
--set nfd.install=true \
gpu-operator oci://ghcr.io/intel/gpu-base-operator/intel-gpu-base-operator-chart
helm install --namespace "intel-gpu-operator" --version 0.0.1 \
--set resourceRegistration=dp \
--set useNFDLabeling=true \
gpu-policy oci://ghcr.io/intel/gpu-base-operator/intel-gpu-base-operator-policy-chart
Uninstalling the charts:
helm uninstall --namespace "intel-gpu-operator" gpu-policy --wait
helm uninstall --namespace "intel-gpu-operator" gpu-operator
kubectl delete ns intel-gpu-operator
See more details for the charts in the operator and policy READMEs.
CR fields control how the operator configures the cluster. See the full struct for all options.
| Field | Description | Default |
|---|---|---|
spec.resourceRegistration |
Resource registration mode: dp (Device Plugin) or dra (Dynamic Resource Allocation) |
— |
spec.resourceMonitoring |
Deploy XPU-Manager for per-GPU telemetry and health monitoring | false |
spec.useNFDLabeling |
Deploy an NFD NodeFeatureRule to label Intel GPU nodes; requires NFD to be installed | false |
spec.prometheusIntegration |
Create a ServiceMonitor so Prometheus scrapes XPU-Manager GPU metrics; requires Prometheus Operator | false |
spec.enableKueue |
Create Kueue ResourceFlavor / ClusterQueue / LocalQueue resources for GPU workload queuing | false |
spec.logLevel |
Default log level for all deployed components (0–4) | 2 |
Used when spec.resourceRegistration: dp.
| Field | Description | Default |
|---|---|---|
spec.dp.allowIDs |
Only register GPUs whose PCI device ID is in this list. Format: ['0xabcd']. Cannot be combined with denyIDs |
[] |
spec.dp.denyIDs |
Exclude GPUs whose PCI device ID is in this list. Format: ['0xabcd']. Cannot be combined with allowIDs |
[] |
spec.dp.byPathMode |
Controls which DRI by-path symlinks are exposed to containers: single, all, or none |
single |
Used when spec.resourceRegistration: dra.
| Field | Description | Default |
|---|---|---|
spec.dra.deviceTaints |
Apply taints to GPU devices reported as unhealthy by health monitoring | false |
spec.dra.podHealthCheck |
Enable health check for DRA Pod | true |
Applies to both DP and DRA unless noted. Thresholds that are exceeded mark the GPU as unhealthy.
| Field | Description | Default |
|---|---|---|
spec.health.coreTemperatureThreshold |
GPU core temperature limit in °C (1–130) | 100 |
spec.health.memoryTemperatureThreshold |
GPU memory temperature limit in °C (1–130) | 100 |
spec.health.checkIntervalSeconds |
How often health is evaluated in seconds (1–3600). Not supported by Device Plugin | 5 |
| Field | Description | Default |
|---|---|---|
spec.xpu.monitoringResource |
Set XPUMD resource for Device Plugin use. | xe_monitoring |
spec.xpu.configMapOverride |
Name of a ConfigMap in the operator namespace containing a custom OpenTelemetry Collector config.yaml |
— |
When enableKueue: true, the operator creates Kueue ClusterQueue and LocalQueue resources to enable GPU-aware job scheduling via Kueue.
The operator currently supports one configuration for Kueue: equal division. With equal division, the operator discovers all GPU resources available in the cluster and divides them equally across the ClusterQueues defined in spec.kueue.equalResources. It also automatically creates a single ResourceFlavor that covers all GPU resource types. Other configuration options and resource division schemas will be added in the future.
| Field | Description |
|---|---|
spec.kueue.equalResources |
List of ClusterQueues. GPU capacity is split evenly across all entries |
spec.kueue.equalResources[].name |
Name of the Kueue ClusterQueue to create |
spec.kueue.equalResources[].localQueues |
List of LocalQueue resources to create for this ClusterQueue |
spec.kueue.equalResources[].localQueues[].name |
Name of the LocalQueue |
spec.kueue.equalResources[].localQueues[].namespace |
Namespace in which the LocalQueue is created |
NOTE: Kueue prevents Pods from consuming more resource than defined in
ClusterQueues. Kueue does not prevent Pods in oneLocalQueuesfrom consuming all the resources in theClusterQueue, thus leaving no resources for the otherLocalQueues.
To build the operator without containerization.
Prerequisites:
- golang (1.24+)
- make
go mod vendor
make build
To build the operator container.
Prerequisites for the docker-build:
- docker version 17.03+
- make
make docker-build
While developing the operator, it is sometimes required to re-generate the autogenerated code. This can be done by
make generate
If the operator receives (or loses) access rights to the cluster, the deployment files can be updated with:
make manifests
One can run the operator unit tests with:
make test
Cert-manager:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.19.4/cert-manager.yaml
Node Feature Discovery (optional):
kubectl apply -k config/nfd
Kueue (optional):
kubectl apply -k config/kueue
NOTE: As the operator's images are not available anywhere yet, one has to build the container and store it in a registry somewhere.
To store the operator container image in the container-runtime image cache:
docker save ghcr.io/intel/intel-gpu-base-operator:devel | sudo ctr -n k8s.io image import -
To deploy the operator:
kubectl apply -k config/default
To undeploy the operator:
kubectl delete -k config/default
To deploy a Device Plugin CR:
kubectl apply -k config/samples/deviceplugin/
To deploy a DRA CR:
kubectl apply -k config/samples/dra/
WARNING: If the operator is deleted when DRA CR is configured, the XPU-Manager Pod has the tendency to get stuck in
Terminatingphase. This is because DRA needs to tear down the claim for the Pod to get Terminated. If that happens, the quickest fix is to redeploy the operator and custom resource, and then remove the custom resource properly.
The Intel GPU base operator supports updating the GPU firmware on the cluster nodes. The update is handled via a GPUFirmwareUpdate CRD. The update flow and details are explained in the FW update documentation.
Contributions to this project are welcome as issues (bugs, enhancement requests) or via pull requests. Please review our Code of Conduct and our note on security policy.
Copyright 2025 Intel Corporation. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries.