Version: Next

Serving

This page walks you through deploying and managing an AI model as an NPU/GPU-based inference service.

Serving List

In the left sidebar, click Development > Serving.

Serving list

Click the Connect button to find the endpoint URL for accessing the deployed service. An inference endpoint URL in the form https://<host>/<project>/<deployment-name> is shown. Append the inference framework's API path (e.g., /v1/chat/completions) to send inference requests from outside.

Status

Status	Description	Recovery
Ready	All Pods are in the Ready state. The service is operating normally.	—
Starting	Pods are starting. Not yet Ready, e.g., still loading the model.	Wait briefly. If it persists, check the logs.
Degraded	Only some Pods are Ready. Requests are still served but overall performance is reduced.	Check the logs and events of the failing Pods.
Error	One or more Pods are in an error state such as CrashLoopBackOff.	Click the Status column → check failureReason and logs in the popover.
Pending	Pods are not scheduled. Resource shortage or image pull failure.	Check cluster resource availability and image settings.
Scaled Down	Replicas were scaled to 0.	Change Replicas to 1 or more if needed.

Hover or click on the Status column to see the Pod status popover. The popover includes the main error reason, the Ready count, the list of failing Pods, and a View logs link for each Pod (opens a new tab to the Logs tab).

Create a Serving

Click Create to go to the creation page. Creation proceeds in 3 steps.

Step 1. Basic Information
Step 2. Detailed Settings
Step 3. Review & Deploy

Serving creation - Basic information

Field	Description	Required
Service Name	Serving name (lowercase, digits, hyphens, up to 63 characters)	✓
Description	Serving description	-
Service Template	Inference framework selection	✓

Common settings:

Common settings

Field	Description	Default
Image	Container image	-
CPU	Number of CPU cores	0.5
Memory	Memory	1Gi
Accelerator	Accelerator type	None
Accelerator Count	Number of accelerators	1
Replicas	Number of service containers	1

Per-template additional settings:

vLLM
Custom

vLLM additional settings

Field	Description	Default
Model	Model name or path (e.g., `meta-llama/Llama-3.1-8B-Instruct`)	-
Tensor Parallel Size	Number of GPUs to split the model across	1
Data Type	Numeric representation for model operations	Auto
Max Model Length	Maximum tokens processed (input + output combined)	-
Quantization	Reduce model precision to save memory	-
GPU Memory Utilization	Fraction of GPU memory vLLM will use (0.0–1.0)	0.9
Additional Arguments	Enter advanced vLLM settings directly	-

Data Volumes (optional):

Add PVCs to mount. Hardcoded volumes (model-cache, dshm) are shown as read-only system defaults; users can additionally specify PVCs they created.

Serving creation - Data Volumes section

You can mount additional user PVCs via the + Add Volume button.

Advanced Settings (optional):

Field	Description	Default
Inference Port	Service inference port	8000
Command Override	Container start command	-
Environment Variables	Environment variables (KEY=VALUE or env file)	-
Transformer	Add a pre/post-processing sidecar	-

Serving Detail Page

Click an item in the Serving list to go to its detail page. Click the Edit button at the top right of the Overview tab to switch into edit mode; after changes, click the Save Changes button in the Floating Save Bar at the bottom of the screen to apply. If changes that require a Pod restart (image, port, resources, volumes, etc.) are included, a confirmation dialog is shown.

Overview
Metrics
Logs
Async Queue
Settings

Serving detail - Overview edit mode

Card Layout

Card	Description
Status	Ready Replicas, Health, Auto Scaling, creation time
Basic Information	Serving name, description
Container	Container image, inference port
Resources	CPU, Memory, accelerator type/count, Replicas
Command & Arguments	Container start command and arguments
Environment Variables	List of environment variables
Volumes	List of mounted PVCs
Transformer	Pre/post-processing sidecar settings
Pods	Per-Pod status table — failing Pods sorted first

Pods section

Serving detail — Pods section

Shows per-Pod status, node, restart count, age, and failure reason in a table.

Column	Description
Status	Current status of the Pod (Running / Pending / CrashLoopBackOff, etc.)
Node	Name of the node the Pod is scheduled on
Restarts	Container restart count
Age	Time since the Pod was created
Reason	Failure reason shown when Ready=false or on error
View logs	Log link shown for Pods where Ready=false or restartCount > 0. Opens a new tab to the Logs tab of that Pod.

Failing Pods are sorted to the top of the table.

Serving detail - Metrics

Use the time range buttons (1h / 6h / 24h) at the top right to adjust the graph period.

Card	Information
Total Requests	Total request count
Latency (P50 / P95 / P99)	Current request response times
RPS graph	Time series of requests per second
Latency graph	Time series of P50/P95/P99 response times

When you enable Async Queue in Settings, you can browse the asynchronous request list and the results.

Async Queue enabled

Item	Description
Status filter	All / Pending / Processing / Completed / Failed
Queue Depth	Number of Tasks currently in the queue
Processing	Number of Tasks currently in progress

Click a specific Task to view Request / Response details.

Async Queue Task detail

How to make an async request — Add the X-Async: true header to the HTTP request.

curl -X POST 'https://<deployment-endpoint>/v1/chat/completions' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <token>' \
  -H 'X-Async: true' \
  -d '{"model": "model-name", "messages": [{"role": "user", "content": "Hello"}]}'

Naming Rules

Lowercase letters, digits, and hyphens (-) are allowed
Must not start or end with a hyphen
Up to 63 characters (Kubernetes limit)

Examples: my-model-v1, llm-server-prod

Supported Accelerators

Accelerator	Resource Key	Available Features
NVIDIA GPU	`nvidia.com/gpu`	Lab, Serving
Furiosa RNGD	`furiosa.ai/rngd`	Lab, Serving

Serving List​

Status​

Create a Serving​

Serving Detail Page​

Card Layout​

Naming Rules​

Supported Accelerators​