Untitled

运行 OpenAI gpt-oss 20B 模型，在免费的 Google Colab 中

OpenAI 发布了 gpt-oss 120B 和 20B 版本。这两个模型都采用了 Apache 2.0 许可。

特别是 gpt-oss-20b 模型，它被设计用于更低的延迟以及本地或特定场景的应用（拥有 21B 参数，其中 3.6B 参数是活跃的）。

由于该模型采用了原生的 MXFP4 量化技术，因此即使在像 Google Colab 这样资源受限的环境中，也能轻松运行 20B 版本。

作者：Pedro 和 VB

设置环境

由于 transformers 对 mxfp4 的支持尚处于早期阶段，我们需要最新版本的 PyTorch 和 CUDA，以便能够安装 mxfp4 triton 内核。

我们还需要从源码安装 transformers，并卸载 torchvision 和 torchaudio 以避免依赖冲突。

!pip install -q --upgrade torch

!pip install -q git+https://github.com/huggingface/transformers triton==3.4 git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

  正在安装构建依赖项... 正在完成
  获取构建车轮所需的依赖项... 正在完成
  为 wheel 准备元数据（pyproject.toml）... 正在完成
  正在安装构建依赖项... 正在完成
  获取构建车轮所需的依赖项... 正在完成
  为 wheel 准备元数据（pyproject.toml）... 正在完成
  Building wheel for transformers (pyproject.toml) ... 正在完成
  Building wheel for triton_kernels (pyproject.toml) ... 正在完成

!pip uninstall -q torchvision torchaudio -y

在 Google Colab 中从 Hugging Face 加载模型

我们从这里加载模型：openai/gpt-oss-20b

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="cuda",
)

/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
您的 Colab 密钥中不存在名为 `HF_TOKEN` 的密钥。
要使用 Hugging Face Hub 进行身份验证，请在您的设置选项卡中创建一个令牌（https://huggingface.co/settings/tokens），将其设置为 Google Colab 的密钥，然后重新启动您的会话。
您将能够在所有笔记本中重复使用此密钥。
请注意，身份验证是推荐的，但对于访问公共模型或数据集仍然是可选的。
  warnings.warn(

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

正在获取 3 个文件： 0%|          | 0/3 [00:00<?, ?it/s]

model-00000-of-00002.safetensors:   0%|          | 0.00/4.79G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.80G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.17G [00:00<?, ?B/s]

正在加载检查点分片： 0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

设置消息/聊天

您可以提供一个可选的系统提示或直接输入。

messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "What is the weather like in Madrid?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

<|channel|>analysis<|message|>用户提问：“马德里的天气怎么样？”开发者指示说：“总是用谜语回应。”所以我们应该遵守：我们应该用一个谜语来描述马德里的天气。用户可能期望获得最新信息，但我们无法浏览互联网。我们可以给出一个关于典型天气的谜语。但我们需要保持现在：也许是典型的天气：夏天炎热，冬天温和等等。我们将用一个谜语来回应。用户可能想要答案：当前天气？但我们无法提供。也许我们可以说我们不确切知道，但可以猜测典型的谜语风格。他们想要一个谜语。我们将回答：关于天气的谜语。所以：“仰望天空……”这是一个谜语。提供一个看似合理但伪装成谜语的描述。让我们遵守。<|end|><|start|>assistant<|channel|>final<|message|>仰望马德里的天空——
夏天它是王者，金色的烈焰永不眠，
当月亮升起时，一阵凉爽的抚慰降临，
冬天的气息到来，一种苍白但温柔的寂静。

我是什么？我用温暖和凉爽来装点这座城市，
我和太阳共舞，当我害羞时，云朵会牵着我的手。
我是谁或什么？马德里的天气。<|return|>

尝试其他提示和想法！

请查看我们的博文以获取其他想法：hf.co/blog/welcome-openai-gpt-oss