用 Nginx 给 vLLM 做动态负载均衡

在多台 vLLM 推理节点同时提供服务时，最直接的做法通常是先在前面放一层 Nginx 做转发，再由上游节点共同承接请求。

但如果只是静态轮询，实际效果往往并不理想。因为大模型服务的负载并不均匀，某一台节点可能已经接近 KV Cache 极限、出现等待队列，甚至开始发生 preemption，而另一台节点却还比较空闲。这个时候，负载均衡如果不能感知后端状态，就容易把请求继续打到“已经很忙”的节点上。

这篇文章整理一套比较实用的做法：使用 Nginx 作为统一入口，配合一个定时执行的权重调整脚本，根据 vLLM 指标动态改写 upstream 配置，并自动 reload Nginx。

整体链路

整体上可以把这套链路理解成两条并行路径：一条是业务请求流量，另一条是指标采集与权重回写。

一、Nginx 入口配置

先在 Nginx 里定义一个用于承接 OpenAI 兼容接口请求的 server：

server {
    listen 80;
    server_name 10.0.0.10;

    access_log /var/log/nginx/llm_access.log upstream_trace;
    error_log /var/log/nginx/llm_error.log info;

    location /v1/ {
        proxy_pass http://llm_cluster;

        proxy_buffering off;
        proxy_cache off;

        proxy_http_version 1.1;
        proxy_set_header Connection "";

        client_body_buffer_size 8m;

        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
        proxy_connect_timeout 5s;
        send_timeout 300s;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        add_header X-Upstream-Addr $upstream_addr always;
        add_header X-Upstream-Status $upstream_status always;
    }
}

这里有几个点比较关键：

proxy_buffering off 适合流式输出场景，避免响应被 Nginx 缓冲
proxy_http_version 1.1 和 Connection "" 有助于上游长连接保持稳定
把 X-Upstream-Addr 和 X-Upstream-Status 打回响应头，后续排查路由问题会方便很多

二、定义 upstream 集群

如果先不考虑动态调度，可以从一个最基本的 upstream 开始：

upstream llm_cluster {
    least_conn;
    server 10.0.0.10:18090 weight=100 max_fails=10 fail_timeout=60s;
    server 10.0.0.11:18090 weight=100 max_fails=10 fail_timeout=60s;
    keepalive 64;
}

least_conn 比简单轮询更适合推理服务，因为不同请求的处理时间差异很大。即便如此，如果后端节点的显存占用、等待队列和 preemption 情况差异明显，单靠 least_conn 依然不够。

三、用 systemd 定时执行权重调整

这套方案的思路很简单：每隔一段时间拉取一次所有后端的指标，根据状态重新计算权重，如果权重有变化，就改写 upstream 配置并 reload Nginx。

为了让这个动作自动运行，可以交给 systemd 的 service 和 timer：

[Unit]
Description=Dynamic nginx load balancing for vLLM backends
After=network.target nginx.service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/vllm_nginx_lb.sh

[Unit]
Description=Run dynamic nginx load balancing every 10 seconds

[Timer]
OnBootSec=10s
OnUnitActiveSec=10s
AccuracySec=1s
Unit=vllm-nginx-lb.service

[Install]
WantedBy=timers.target

这种方式的优点是简单直接，不需要额外常驻一个守护进程，也方便通过 systemctl status 和 journalctl 排查问题。

四、指标采集与权重调整脚本

下面是一版示例脚本。核心逻辑是：

从每个 vLLM 节点抓取 /metrics
提取 KV Cache 使用率、等待队列、运行中请求数、preemption 总数
结合当前值与历史值计算新的权重
改写 upstream 配置并 reload Nginx

#!/usr/bin/env python3
import os
import re
import shutil
import subprocess
import sys
import tempfile
import urllib.request

API_KEY = 'YOUR_API_KEY'
STATE_DIR = '/var/lib/vllm-nginx-lb'
UPSTREAM_FILE = '/etc/nginx/conf.d/llm_upstream.conf'
HOSTS = {
    'node-a': '10.0.0.10',
    'node-b': '10.0.0.11',
}

METRIC_PATTERNS = {
    'kv_cache_usage': re.compile(r'^vllm:kv_cache_usage_perc\\{.*\\}\\s+([0-9.eE+-]+)$'),
    'waiting': re.compile(r'^vllm:num_requests_waiting\\{.*\\}\\s+([0-9.eE+-]+)$'),
    'running': re.compile(r'^vllm:num_requests_running\\{.*\\}\\s+([0-9.eE+-]+)$'),
    'preemptions': re.compile(r'^vllm:num_preemptions_total\\{.*\\}\\s+([0-9.eE+-]+)$'),
}

os.makedirs(STATE_DIR, exist_ok=True)


def fetch_metrics(host: str) -> str:
    req = urllib.request.Request(
        f'http://{host}:18090/metrics',
        headers={'Authorization': f'Bearer {API_KEY}'},
    )
    with urllib.request.urlopen(req, timeout=5) as resp:
        return resp.read().decode('utf-8', errors='ignore')


def parse_metric(raw: str, key: str) -> float:
    pattern = METRIC_PATTERNS[key]
    for line in raw.splitlines():
        match = pattern.match(line)
        if match:
            return float(match.group(1))
    raise RuntimeError(f'metric {key} not found')


def load_prev(name: str) -> int:
    path = os.path.join(STATE_DIR, f'{name}.preemptions')
    if not os.path.exists(path):
        return 0
    try:
        return int(open(path, 'r', encoding='utf-8').read().strip() or '0')
    except Exception:
        return 0


def save_prev(name: str, value: int) -> None:
    path = os.path.join(STATE_DIR, f'{name}.preemptions')
    with open(path, 'w', encoding='utf-8') as f:
        f.write(str(value))


def downgrade(weight: int) -> int:
    return {100: 60, 60: 30, 30: 10}.get(weight, 10)


def compute_weight(kv_cache_usage: float, waiting: int, running: int, preemption_delta: int) -> int:
    if kv_cache_usage >= 0.92:
        weight = 10
    elif kv_cache_usage >= 0.85:
        weight = 30
    elif kv_cache_usage >= 0.75:
        weight = 60
    else:
        weight = 100

    if waiting > 0:
        weight = downgrade(weight)
    if preemption_delta > 0:
        weight = downgrade(weight)
    if running > 8:
        weight = downgrade(weight)
    return weight


def render_upstream(weights: dict) -> str:
    return (
        'upstream llm_cluster {\\n'
        '    least_conn;\\n'
        f'    server {HOSTS[\"node-a\"]}:18090 weight={weights[\"node-a\"]} max_fails=10 fail_timeout=60s;\\n'
        f'    server {HOSTS[\"node-b\"]}:18090 weight={weights[\"node-b\"]} max_fails=10 fail_timeout=60s;\\n'
        '    keepalive 64;\\n'
        '}\\n'
    )


def log_metrics(metrics: dict, weights: dict) -> None:
    print(
        'vllm-nginx-lb '
        f\"node-a(weight={weights['node-a']},kv={metrics['node-a']['kv_cache_usage']:.3f},waiting={metrics['node-a']['waiting']},running={metrics['node-a']['running']},preempt_delta={metrics['node-a']['preemption_delta']}) \"
        f\"node-b(weight={weights['node-b']},kv={metrics['node-b']['kv_cache_usage']:.3f},waiting={metrics['node-b']['waiting']},running={metrics['node-b']['running']},preempt_delta={metrics['node-b']['preemption_delta']})\"
    )


def main() -> int:
    metrics = {}
    for name, host in HOSTS.items():
        raw = fetch_metrics(host)
        current_preemptions = int(parse_metric(raw, 'preemptions'))
        previous_preemptions = load_prev(name)
        delta = max(0, current_preemptions - previous_preemptions)
        save_prev(name, current_preemptions)
        metrics[name] = {
            'kv_cache_usage': parse_metric(raw, 'kv_cache_usage'),
            'waiting': int(parse_metric(raw, 'waiting')),
            'running': int(parse_metric(raw, 'running')),
            'preemption_delta': delta,
        }

    weights = {name: compute_weight(**values) for name, values in metrics.items()}
    log_metrics(metrics, weights)
    new_content = render_upstream(weights)

    old_content = ''
    if os.path.exists(UPSTREAM_FILE):
        with open(UPSTREAM_FILE, 'r', encoding='utf-8') as f:
            old_content = f.read()
    if new_content == old_content:
        return 0

    fd, tmp_path = tempfile.mkstemp(prefix='llm_upstream.', text=True)
    try:
        with os.fdopen(fd, 'w', encoding='utf-8') as tmp:
            tmp.write(new_content)
        backup_path = UPSTREAM_FILE + '.bak'
        if os.path.exists(UPSTREAM_FILE):
            shutil.copy2(UPSTREAM_FILE, backup_path)
        shutil.move(tmp_path, UPSTREAM_FILE)
        subprocess.run(['nginx', '-t'], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        subprocess.run(['nginx', '-s', 'reload'], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        return 0
    except Exception:
        backup_path = UPSTREAM_FILE + '.bak'
        if os.path.exists(backup_path):
            shutil.move(backup_path, UPSTREAM_FILE)
            subprocess.run(['nginx', '-t'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
            subprocess.run(['nginx', '-s', 'reload'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        raise
    finally:
        if os.path.exists(tmp_path):
            os.remove(tmp_path)


if __name__ == '__main__':
    try:
        raise SystemExit(main())
    except Exception as exc:
        print(f'vllm_nginx_lb failed: {exc}', file=sys.stderr)
        raise

五、权重逻辑怎么理解

这份脚本没有追求特别复杂的调度算法，而是用一种容易维护的阶梯式权重策略：

KV Cache 使用率低，权重给到 100
使用率升高后，逐级降到 60 / 30 / 10
如果节点已经出现等待队列、preemption 增量，或者运行中请求数过高，再额外降级一次

这种策略的优点是足够直观，而且容易根据现场情况调整阈值。对于大模型推理服务来说，很多时候“简单但能稳定工作”的调度策略，反而比复杂策略更实用。

六、为什么这样做

如果只看 CPU 或系统负载，大模型服务的真实压力往往看不出来。真正影响吞吐和时延的，往往是这些更贴近推理内部状态的指标：

KV Cache 使用率
等待中的请求数
运行中的请求数
preemption 是否开始出现

当某台机器已经开始排队或者频繁 preemption 时，就说明它已经不适合继续承接太多新请求。此时把权重降下来，让流量更多落到相对空闲的节点上，整体体验通常会更稳定。

七、落地时的几个注意点

如果 /metrics 需要鉴权，脚本里要带上对应的认证头
定时周期不要太短，10 秒通常已经够用，过短会让 Nginx reload 过于频繁
改写 upstream 前最好保留备份，并在 reload 失败时自动回滚
如果节点数量继续增加，建议把主机列表和阈值独立成配置文件

这套方案的本质，不是让 Nginx 变成一个“智能调度器”，而是把 vLLM 自己暴露出来的运行指标真正用起来。对小规模推理集群来说，这样的做法已经足够实用，也更容易长期维护。