HAMi Enterprise Online Deployment Guide
This document is intended for SREs / platform engineers and describes how to deploy HAMi Enterprise on a Kubernetes cluster, enable GPU nodes, integrate monitoring, and verify functionality.
⚠️ Installation ≠ Activation
After completing the Helm Charts installation in this section, the core components of HAMi Enterprise will be running, but GPU virtualization and scheduling functionality require a licence activation before they can be used normally.
The installation itself does not depend on a licence; you can complete the deployment first, then apply for and import the licence in subsequent steps.
In short: install the software first, then obtain the licence; if not activated, vGPU partitioning and scheduling functionality will be unavailable, and function verification will also fail.
Prerequisites Checklist
| Type | Requirement | Verification Command |
|---|---|---|
| Kubernetes | ≥ 1.24 | kubectl version --short |
| Container Runtime | containerd or Docker | kubectl get nodes -o wide |
| Helm | ≥ 3.14 | helm version --short |
| GPU Driver | NVIDIA driver ≥ 470 (recommended ≥ 550) | nvidia-smi |
| Prometheus CRD | Prometheus monitoring CRD must be installed to be compatible with different monitoring metric collection systems: Prometheus, VictoriaMetrics, etc. | kubectl api-resources | grep monitoring.coreos.com/v1 |
| GPU Operator | Installed and devicePlugin.enabled = false, recommended version: v25.3.2) | helm list -A | grep gpu-operator |
| Storage Space | Recommended greater than 30 GB | df -h |
Key Constraint: HAMi includes its own device-plugin, which conflicts with the built-in device-plugin of the NVIDIA GPU Operator. If GPU Operator is already installed, be sure to disable its built-in plugin with --set devicePlugin.enabled=false.
Installation
Two installation paths; choose according to your scenario:
- Online OCI installation (evaluation, PoC, clusters with external network access)
- All-in-One offline package (finance / government / telecom isolated network scenarios)
Regardless of the installation method, you will eventually need to apply for a licence and activate.
Path A: Online Helm Charts Installation
If you want to use a Chinese mainland mirror registry, please contact Dynamia.ai's pre-sales / technical support for relevant information.
It is recommended to use a version control system to maintain the values files for all Helm Chart releases in the cluster. Use -f example-values.yaml to override the corresponding keys in the default values of the Charts.
After selecting the appropriate kubeconfig context, proceed with the following steps:
If nvidia/gpu-operator is not installed, install it first.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set devicePlugin.enabled=false \
--set dcgmExporter.serviceMonitor.enabled=true \
--version=v25.3.2
If the cluster does not have a Prometheus monitoring stack, it also needs to be installed. Here we show how to install prometheus-community/kube-prometheus-stack.
helm install prometheus \
oci://ghcr.io/prometheus-community/charts/kube-prometheus-stack \
--version 72.3.0 \
--namespace monitoring \
--create-namespace \
--set alertmanager.enabled=false \
--set grafana.enabled=false
Install dynamia-ai/hami-enterprise:
helm install hami \
oci://ghcr.io/dynamia-ai/charts/hami-enterprise \
--version 2.9.0-rc2 \
--namespace hami-system \
--create-namespace
Common Chart customization options for hami-enterprise are shown in the table below. For the complete values configuration, see: HAMi Helm Chart Values Reference.
| Parameter | Description | Default Value |
|---|---|---|
dra.enabled | Whether to deploy and enable DRA | false |
scheduler.leaderElect | Whether to enable multi-node leader election for hami-scheduler | true |
scheduler.replicas | Adjust the number of hami-scheduler instances | 1 |
scheduler.kubeScheduler.image.registry | Image registry for the kube-scheduler used by hami-scheduler. | "registry.cn-hangzhou.aliyuncs.com" |
scheduler.kubeScheduler.image.repository | Image repository for the kube-scheduler used by hami-scheduler. | "google_containers/kube-scheduler" |
scheduler.kubeScheduler.image.tag | Image tag for the kube-scheduler used by hami-scheduler. If left blank, the chart will infer a suitable version. | "" |
Path B: All-in-One Offline Package
Please contact Dynamia.ai's pre-sales / technical support partners for the download address and operation manual.
Enable GPU Nodes
HAMi device-plugin only starts on nodes with the gpu=on label (can be applied manually):
kubectl label nodes <node-name> gpu=on
Verify: kubectl -n hami-system get pods should show hami-device-plugin-* and hami-scheduler-* in Running state.
Monitoring Integration
Ensure the monitoring metric system in the cluster (kube-prometheus-stack Prometheus, VictoriaMetrics vmagent, etc.) can collect HAMi and DCGM-Exporter metrics.
If using Prometheus, the metadata.labels of the ServiceMonitor resource must match the spec.serviceMonitorSelector field of the Prometheus resource; otherwise Prometheus will not collect these metrics.
If using VictoriaMetrics, the metadata.labels of the ServiceMonitor resource must match the spec.serviceScrapeSelector field of the VMServiceScrape resource; otherwise vmagent will not collect these metrics.
Verify Metric Collection
| Exporter | Query Metric | Expected |
|---|---|---|
dcgm-exporter | DCGM_FI_DEV_GPU_UTIL | Returns a non-empty value |
hami-exporter | HostCoreUtilization | Returns a non-empty value |
hami-device-plugin-exporter | GPUDeviceCoreAllocated | Returns a non-empty value |
Licence Acquisition
Please complete the installation tasks above and ensure that all component Pods are running normally before starting the activation process.
Execute the following script to collect licence information (requires kubectl, jq):
# Online script acquisition
curl -fsSL https://public.hami.run/collect-hami-license-info.sh | bash
# Offline installation (included in the package)
bash collect-hami-license-info.sh
After execution, you will see the following JSON content:
{
"esn": "96565d61-986a-4918-aafb-448ff6e3746b",
"deviceInstances": [
{
"uuid": "GPU-ceee905d-48ac-93de-a81b-17c00e1e5e02",
"deviceType": "NVIDIA A10"
}
]
}
Send the above JSON to Dynamia.ai's pre-sales / technical support to obtain the licence.
Post-Activation Verification
# 1. Pod status
kubectl -n hami-system get pods
# 2. GPU resources registered by the Device Plugin
kubectl describe node <gpu-node> | grep -A 5 'Capacity:'
# Expected to see: nvidia.com/gpu: <N> and nvidia.com/gpumem: <MB>
# 3. Submit a test Pod to verify scheduling
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: hami-smoke
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:12.4.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 2000
EOF
kubectl logs hami-smoke
Expected: the nvidia-smi output should show GPU information, and the video memory should be limited to 2000 MiB.
FAQ
| Symptom | Possible Cause | Resolution |
|---|---|---|
| Image cannot be pulled | Node has no external network or poor connectivity to ghcr.io. | Contact Dynamia.ai's pre-sales / technical support for a domestic mirror registry address or the All-in-One offline package. |
hami-device-plugin Pod Pending or does not exist | Node is not labeled with gpu=on | kubectl label nodes <node> gpu=on |
hami-device-plugin Pod CrashLoopBackOff | Conflicts with the NVIDIA default device-plugin | Disable GPU Operator's devicePlugin (--set devicePlugin.enabled=false). |
| HAMi metrics not found | The serviceMonitorSelector of the Prometheus resource does not match the labels in the ServiceMonitor resource | Align the spec.serviceMonitorSelector of prometheus/prometheus-kube-prometheus-prometheus with the serviceMonitor labels of hami-enterprise. |
nvidia-smi error | GPU driver not ready | Check the driver Pod status under the gpu-operator namespace. |
Example workload Pending | Licence not activated, insufficient GPU, or missing node labels | Check the licence, GPU node labels, and kubectl describe pod events |
Get Support
- Email: info@dynamia.ai
- Pre-sales / Technical Support: 400-026-7800
- Customers with signed commercial contracts please submit issues through the dedicated support channel