Skip to main content
Version: Next

Serving

This page walks you through deploying and managing an AI model as an NPU/GPU-based inference service.

Serving List

In the left sidebar, click Development > Serving.

Serving list

Click the Connect button to find the endpoint URL for accessing the deployed service. An inference endpoint URL in the form https://<host>/<project>/<deployment-name> is shown. Append the inference framework's API path (e.g., /v1/chat/completions) to send inference requests from outside.

Status

StatusDescriptionRecovery
ReadyAll Pods are in the Ready state. The service is operating normally.
StartingPods are starting. Not yet Ready, e.g., still loading the model.Wait briefly. If it persists, check the logs.
DegradedOnly some Pods are Ready. Requests are still served but overall performance is reduced.Check the logs and events of the failing Pods.
ErrorOne or more Pods are in an error state such as CrashLoopBackOff.Click the Status column → check failureReason and logs in the popover.
PendingPods are not scheduled. Resource shortage or image pull failure.Check cluster resource availability and image settings.
Scaled DownReplicas were scaled to 0.Change Replicas to 1 or more if needed.

Hover or click on the Status column to see the Pod status popover. The popover includes the main error reason, the Ready count, the list of failing Pods, and a View logs link for each Pod (opens a new tab to the Logs tab).


Create a Serving

Click Create to go to the creation page. Creation proceeds in 3 steps.

Serving creation - Basic information

FieldDescriptionRequired
Service NameServing name (lowercase, digits, hyphens, up to 63 characters)
DescriptionServing description-
Service TemplateInference framework selection

Serving Detail Page

Click an item in the Serving list to go to its detail page. Click the Edit button at the top right of the Overview tab to switch into edit mode; after changes, click the Save Changes button in the Floating Save Bar at the bottom of the screen to apply. If changes that require a Pod restart (image, port, resources, volumes, etc.) are included, a confirmation dialog is shown.

Serving detail - Overview edit mode

Card Layout

CardDescription
StatusReady Replicas, Health, Auto Scaling, creation time
Basic InformationServing name, description
ContainerContainer image, inference port
ResourcesCPU, Memory, accelerator type/count, Replicas
Command & ArgumentsContainer start command and arguments
Environment VariablesList of environment variables
VolumesList of mounted PVCs
TransformerPre/post-processing sidecar settings
PodsPer-Pod status table — failing Pods sorted first

Pods section

Serving detail — Pods section

Shows per-Pod status, node, restart count, age, and failure reason in a table.

ColumnDescription
StatusCurrent status of the Pod (Running / Pending / CrashLoopBackOff, etc.)
NodeName of the node the Pod is scheduled on
RestartsContainer restart count
AgeTime since the Pod was created
ReasonFailure reason shown when Ready=false or on error
View logsLog link shown for Pods where Ready=false or restartCount > 0. Opens a new tab to the Logs tab of that Pod.

Failing Pods are sorted to the top of the table.


Naming Rules

  • Lowercase letters, digits, and hyphens (-) are allowed
  • Must not start or end with a hyphen
  • Up to 63 characters (Kubernetes limit)

Examples: my-model-v1, llm-server-prod


Supported Accelerators

AcceleratorResource KeyAvailable Features
NVIDIA GPUnvidia.com/gpuLab, Serving
Furiosa RNGDfuriosa.ai/rngdLab, Serving