如何使用 vLLM 运行 gpt-oss

vLLM 是一个开源的高吞吐量推理引擎，通过优化内存使用和处理速度来高效地服务大型语言模型（LLMs）。本指南将引导您完成使用 vLLM 在服务器上设置 gpt-oss-20b 或 gpt-oss-120b 的步骤，以便将 gpt-oss 作为 API 为您的应用程序提供服务，甚至将其连接到 Agents SDK。

请注意，本指南适用于具有专用 GPU（如 NVIDIA H100）的服务器应用程序。对于在消费级 GPU 上进行本地推理，请参阅我们的 Ollama 或 LM Studio 指南。

选择您的模型

vLLM 支持 gpt-oss 的两种模型尺寸：

openai/gpt-oss-20b
较小的模型
仅需约 16GB VRAM
openai/gpt-oss-120b
我们较大尺寸的完整模型
最适合 ≥60GB VRAM
可安装在单个 H100 或多 GPU 设置上

两种模型均开箱即用 MXFP4 量化。

快速设置

安装 vLLM vLLM 建议使用 uv 来管理您的 Python 环境。这将有助于根据您的环境选择正确的实现。在他们的快速入门中了解更多。要创建新的虚拟环境并安装 vLLM，请运行：

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
    --index-strategy unsafe-best-match

启动服务器并下载模型 vLLM 提供了一个 serve 命令，该命令将自动从 HuggingFace 下载模型，并在 localhost:8000 上启动一个与 OpenAI 兼容的服务器。在服务器的终端会话中，根据您想要的模型大小运行以下命令。

# For 20B
vllm serve openai/gpt-oss-20b

# For 120B
vllm serve openai/gpt-oss-120b

使用 API

vLLM 暴露了一个 Chat Completions 兼容 API 和一个 Responses 兼容 API，因此您无需太多更改即可使用 OpenAI SDK。这是一个 Python 示例：

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

result = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what MXFP4 quantization is."}
    ]
)

print(result.choices[0].message.content)

response = client.responses.create(
    model="openai/gpt-oss-120b",
    instructions="You are a helfpul assistant.",
    input="Explain what MXFP4 quantization is."
)

print(response.output_text)

如果您以前使用过 OpenAI SDK，这会感觉很熟悉，并且通过更改基础 URL，您现有的代码应该可以正常工作。

使用工具（函数调用）

vLLM 支持函数调用，并为模型提供浏览功能。

函数调用可以通过 Responses 和 Chat Completions API 进行。

通过 Chat Completions 调用函数的示例：

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
            },
        },
    }
]

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
    tools=tools
)

print(response.choices[0].message)

由于模型可以将工具调用作为思维链（CoT）的一部分执行，因此您需要将 API 返回的推理过程返回到后续的工具调用中，并提供答案，直到模型得出最终答案。

Agents SDK 集成

想将 gpt-oss 与 OpenAI 的 Agents SDK 一起使用吗？

两个 Agents SDK 都允许您覆盖 OpenAI 的基础客户端，将其指向 vLLM 以便使用您自托管的模型。或者，对于 Python SDK，您还可以使用 LiteLLM 集成来代理到 vLLM。

这是一个 Python Agents SDK 示例：

uv pip install openai-agents

import asyncio
from openai import AsyncOpenAI
from agents import Agent, Runner, function_tool, OpenAIResponsesModel, set_tracing_disabled

set_tracing_disabled(True)

@function_tool
def get_weather(city: str):
    print(f"[debug] getting weather for {city}")
    return f"The weather in {city} is sunny."


async def main(model: str, api_key: str):
    agent = Agent(
        name="Assistant",
        instructions="You only respond in haikus.",
        model=OpenAIResponsesModel(
            model="openai/gpt-oss-120b",
            openai_client=AsyncOpenAI(
                base_url="http://localhost:8000/v1",
                api_key="EMPTY",
            ),
        )
        tools=[get_weather],
    )

    result = await Runner.run(agent, "What's the weather in Tokyo?")
    print(result.final_output)

if __name__ == "__main__":
    asyncio.run(main())

使用 vLLM 进行直接采样

除了使用 vllm serve 作为 API 服务器运行 vLLM 之外，您还可以直接使用 vLLM Python 库来控制推理。

如果您直接使用 vLLM 进行采样，请务必确保您的输入提示遵循 harmony 响应格式，否则模型将无法正常工作。您可以使用 openai-harmony SDK 来实现此目的。

uv pip install openai-harmony

之后，您可以使用 harmony 来编码和解析 vLLM 的 generate 函数生成的令牌。

import json
from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent,
)

from vllm import LLM, SamplingParams

# --- 1) 使用 Harmony 渲染预填充 ---
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

convo = Conversation.from_messages(
    [
        Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
        Message.from_role_and_content(
            Role.DEVELOPER,
            DeveloperContent.new().with_instructions("Always respond in riddles"),
        ),
        Message.from_role_and_content(Role.USER, "What is the weather like in SF?"),
    ]
)

prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)

# Harmony 停止令牌（传递给采样器，这样它们就不会包含在输出中）
stop_token_ids = encoding.stop_tokens_for_assistant_actions()

# --- 2) 使用预填充运行 vLLM ---
llm = LLM(
    model="openai/gpt-oss-120b",
    trust_remote_code=True,
)

sampling = SamplingParams(
    max_tokens=128,
    temperature=1,
    stop_token_ids=stop_token_ids,
)

outputs = llm.generate(
    prompt_token_ids=[prefill_ids],   # batch of size 1
    sampling_params=sampling,
)

# vLLM 同时提供文本和令牌 ID
gen = outputs[0].outputs[0]
text = gen.text
output_tokens = gen.token_ids  # <-- 这些是完成令牌 ID（不含预填充）

# --- 3) 将完成令牌 ID 解析回结构化的 Harmony 消息 ---
entries = encoding.parse_messages_from_completion_tokens(output_tokens, Role.ASSISTANT)

# 'entries' 是结构化的对话条目序列（助手消息、工具调用等）。
for message in entries:
    print(f"{json.dumps(message.to_dict())}")