Version: Next

Serving Settings

In the Settings tab of the Serving detail page, you configure the inference server, traffic, and Transformer.

Inference Server
Traffic Management

Auto Scaling automatically increases or decreases the number of Pods based on request load. It is suitable for services with irregular or hard-to-predict traffic. To keep a fixed number of Pods running at all times, disable it and adjust Replicas only.

Inference Server basic settings

Setting	Description	Default
Replicas	Adjust the number of replicas	1
Auto Scaling	Automatically adjust the Pod count based on traffic load	Off
Readiness Endpoint	Endpoint to check whether a Pod is ready to receive traffic (e.g., `/health`, `/v1/models`)	-
Liveness Endpoint	Endpoint to check whether a Pod is operating normally. Repeated failures trigger automatic restart (e.g., `/health`, `/healthz`)	-

Additional settings when Auto Scaling is enabled:

Auto Scaling enabled settings

Setting	Description	Default
Min Replicas	Minimum number of Pods to keep running at all times. Setting it to `0` lets Pods terminate completely when there is no traffic, saving resources, but incurs a cold start delay (tens of seconds to several minutes) on the next request.	1
Scale-in Delay (s)	Wait time after traffic decrease before scaling Pods down (prevents flapping)	60
Max Replicas	Maximum number of Pods (consider cluster accelerator headroom)	10
Target Response Time (ms)	Target P95 response time used as the auto-scale trigger. Pods are added when exceeded.	5000

Traffic Management is a group of features that control how requests are distributed across multiple Pods, temperature-based traffic protection, and async processing. If you are running with a single Pod, you can leave Load Balancing and Temperature Policy disabled. Async Queue is used for async workflows where the client submits a request without waiting for an immediate response and retrieves the result later.

Traffic Management basic settings

Feature	Description	Default
Load Balancing	Distribute requests across multiple Pods. Recommended when Replicas is 2 or more.	Off
Temperature Policy	Automatically block traffic to a Pod when GPU/NPU temperature exceeds a threshold, and resume when it recovers. Recommended for hardware protection during long, high-load inference.	Off
Async Queue	Enable a Redis-based async request queue. Suitable for batch inference or long-running tasks where the client does not need to wait for an immediate response after submission.	Off

When Load Balancing is enabled, a Policy dropdown appears:

Load Balancing enabled

Option	Description
LEAST_REQUEST (default)	Route to the Pod with the fewest active requests
ROUND_ROBIN	Route to Pods in rotation
RANDOM	Route to a randomly selected Pod

When Temperature Policy is enabled, threshold settings appear:

Temperature Policy enabled

Setting	Description	Default
Critical Threshold (°C)	Temperature at which traffic is blocked	85
Recovery Threshold (°C)	Temperature at which traffic resumes	70