Skip to main content
Version: Next

Serving Settings

In the Settings tab of the Serving detail page, you configure the inference server, traffic, and Transformer.

Auto Scaling automatically increases or decreases the number of Pods based on request load. It is suitable for services with irregular or hard-to-predict traffic. To keep a fixed number of Pods running at all times, disable it and adjust Replicas only.

Inference Server basic settings

SettingDescriptionDefault
ReplicasAdjust the number of replicas1
Auto ScalingAutomatically adjust the Pod count based on traffic loadOff
Readiness EndpointEndpoint to check whether a Pod is ready to receive traffic (e.g., /health, /v1/models)-
Liveness EndpointEndpoint to check whether a Pod is operating normally. Repeated failures trigger automatic restart (e.g., /health, /healthz)-

Additional settings when Auto Scaling is enabled:

Auto Scaling enabled settings

SettingDescriptionDefault
Min ReplicasMinimum number of Pods to keep running at all times. Setting it to 0 lets Pods terminate completely when there is no traffic, saving resources, but incurs a cold start delay (tens of seconds to several minutes) on the next request.1
Scale-in Delay (s)Wait time after traffic decrease before scaling Pods down (prevents flapping)60
Max ReplicasMaximum number of Pods (consider cluster accelerator headroom)10
Target Response Time (ms)Target P95 response time used as the auto-scale trigger. Pods are added when exceeded.5000