使用自定义数据集评估基于 MCP 的答案

本笔记本使用 OpenAI Evals 框架和自定义内存数据集,评估模型关于 tiktoken GitHub 存储库回答问题的能力。

我们使用自定义的内存数据集(包含问答对),并比较两个模型:gpt-4.1o4-mini,它们利用 MCP 工具来提供对存储库的感知、上下文准确的答案。

目标:

  • 展示如何使用自定义数据集设置和运行 OpenAI Evals 评估。
  • 比较利用基于 MCP 的工具的不同模型的性能。
  • 为专业、可重现的评估工作流程提供最佳实践。

下一步:我们将设置环境并导入必要的库。

# 更新 OpenAI 客户端
%pip install --upgrade openai --quiet
 [1m[ [0m [34;49mnotice [0m [1;39;49m] [0m [39;49m pip 的新版本可用: [0m [31;49m24.0 [0m [39;49m -> [0m [32;49m25.1.1 [0m
 [1m[ [0m [34;49mnotice [0m [1;39;49m] [0m [39;49m 要更新,请运行: [0m [32;49mpip install --upgrade pip [0m
注意:您可能需要重新启动内核才能使用更新的包。

环境设置

我们首先导入所需的库并配置 OpenAI 客户端。此步骤确保我们能够访问 OpenAI API 和评估所需的所有实用程序。

import os
import time

from openai import OpenAI

# 实例化 OpenAI 客户端(无自定义 base_url)。
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY") or os.getenv("_OPENAI_API_KEY"),
)

定义自定义评估数据集

我们定义一个关于 tiktoken 存储库的小型内存数据集(问答对)。此数据集将用于测试模型在 MCP 工具的帮助下提供准确且相关答案的能力。

  • 每个项目包含一个 query(用户的问题)和一个 answer(预期的地面真实答案)。
  • 您可以根据自己的用例或存储库修改或扩展此数据集。
def get_dataset(limit=None):
    items = [
        {
            "query": "What is tiktoken?",
            "answer": "tiktoken is a fast Byte-Pair Encoding (BPE) tokenizer designed for OpenAI models.",
        },
        {
            "query": "How do I install the open-source version of tiktoken?",
            "answer": "Install it from PyPI with `pip install tiktoken`.",
        },
        {
            "query": "How do I get the tokenizer for a specific OpenAI model?",
            "answer": 'Call tiktoken.encoding_for_model("<model-name>"), e.g. tiktoken.encoding_for_model("gpt-4o").',
        },
        {
            "query": "How does tiktoken perform compared to other tokenizers?",
            "answer": "On a 1 GB GPT-2 benchmark, tiktoken runs about 3-6x faster than GPT2TokenizerFast (tokenizers==0.13.2, transformers==4.24.0).",
        },
        {
            "query": "Why is Byte-Pair Encoding (BPE) useful for language models?",
            "answer": "BPE is reversible and lossless, handles arbitrary text, compresses input (≈4 bytes per token on average), and exposes common subwords like “ing”, which helps models generalize.",
        },
    ]
    return items[:limit] if limit else items

定义评分逻辑

为了评估模型的答案,我们使用两个评分器:

  • 通过/失败评分器(基于 LLM): 一个基于 LLM 的评分器,用于检查模型的答案是否与预期答案(地面真实)匹配或传达相同含义。
  • Python MCP 评分器: 一个 Python 函数,用于检查模型在其响应中是否实际使用了 MCP 工具(用于审计工具使用)。

最佳实践: 使用基于 LLM 和基于程序的评分器可以提供更健壮和透明的评估。

# 基于 LLM 的通过/失败评分器:指示模型将答案评为“通过”或“失败”。
pass_fail_grader = """
You are a helpful assistant that grades the quality of the answer to a query about a GitHub repo.
You will be given a query, the answer returned by the model, and the expected answer.
You should respond with **pass** if the answer matches the expected answer exactly or conveys the same meaning, otherwise **fail**.
"""

# 用于评分器的用户提示模板,提供评分的上下文。
pass_fail_grader_user_prompt = """
<Query>
{{item.query}}
</Query>

<Web Search Result>
{{sample.output_text}}
</Web Search Result>

<Ground Truth>
{{item.answer}}
</Ground Truth>
"""


# Python 评分器:通过检查 output_tools 字段来检查 MCP 工具是否被使用。
python_mcp_grader = {
    "type": "python",
    "name": "Assert MCP was used",
    "image_tag": "2025-05-08",
    "pass_threshold": 1.0,
    "source": """
def grade(sample: dict, item: dict) -> float:
    output = sample.get('output_tools', [])
    return 1.0 if len(output) > 0 else 0.0
""",
}

定义评估配置

现在我们使用 OpenAI Evals 客户端配置评估。此步骤指定:

  • 评估名称和数据集。
  • 每个项目的架构(每个问答对包含哪些字段)。
  • 要使用的评分器(基于 LLM 和/或基于 Python)。
  • 通过标准和标签。

最佳实践: 事先清楚地定义评估架构和评分逻辑可确保可重现性和透明度。

# 使用 OpenAI Evals 客户端创建评估定义。
logs_eval = client.evals.create(
    name="MCP Eval",
    data_source_config={
        "type": "custom",
        "item_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "answer": {"type": "string"},
            },
        },
        "include_sample_schema": True,
    },
    testing_criteria=[
        {
            "type": "label_model",
            "name": "General Evaluator",
            "model": "o3",
            "input": [
                {"role": "system", "content": pass_fail_grader},
                {"role": "user", "content": pass_fail_grader_user_prompt},
            ],
            "passing_labels": ["pass"],
            "labels": ["pass", "fail"],
        },
        python_mcp_grader
    ],
)

运行每个模型的评估

现在我们为每个模型(gpt-4.1o4-mini)运行评估。每次运行都配置为:

  • 使用 MCP 工具获取对存储库有感知能力的答案。
  • 使用相同的数据集和评估配置以进行公平比较。
  • 指定特定于模型的参数(例如,最大完成令牌数和允许的工具)。

最佳实践: 在模型之间保持评估设置的一致性可确保结果具有可比性和可靠性。

# 运行 1:使用 MCP 的 gpt-4.1
gpt_4one_responses_run = client.evals.runs.create(
    name="gpt-4.1",
    eval_id=logs_eval.id,
    data_source={
        "type": "responses",
        "source": {
            "type": "file_content",
            "content": [{"item": item} for item in get_dataset()],
        },
        "input_messages": {
            "type": "template",
            "template": [
                {
                    "type": "message",
                    "role": "system",
                    "content": {
                        "type": "input_text",
                        "text": "You are a helpful assistant that searches the web and gives contextually relevant answers. Never use your tools to answer the query.",
                    },
                },
                {
                    "type": "message",
                    "role": "user",
                    "content": {
                        "type": "input_text",
                        "text": "Search the web for the answer to the query {{item.query}}",
                    },
                },
            ],
        },
        "model": "gpt-4.1",
        "sampling_params": {
            "seed": 42,
            "temperature": 0.7,
            "max_completions_tokens": 10000,
            "top_p": 0.9,
            "tools": [
                {
                    "type": "mcp",
                    "server_label": "gitmcp",
                    "server_url": "https://gitmcp.io/openai/tiktoken",
                    "allowed_tools": [
                        "search_tiktoken_documentation",
                        "fetch_tiktoken_documentation",
                    ],
                    "require_approval": "never",
                }
            ],
        },
    },
)
# 运行 2:使用 MCP 的 o4-mini
gpt_o4_mini_responses_run = client.evals.runs.create(
    name="o4-mini",
    eval_id=logs_eval.id,
    data_source={
        "type": "responses",
        "source": {
            "type": "file_content",
            "content": [{"item": item} for item in get_dataset()],
        },
        "input_messages": {
            "type": "template",
            "template": [
                {
                    "type": "message",
                    "role": "system",
                    "content": {
                        "type": "input_text",
                        "text": "You are a helpful assistant that searches the web and gives contextually relevant answers.",
                    },
                },
                {
                    "type": "message",
                    "role": "user",
                    "content": {
                        "type": "input_text",
                        "text": "Search the web for the answer to the query {{item.query}}",
                    },
                },
            ],
        },
        "model": "o4-mini",
        "sampling_params": {
            "seed": 42,
            "max_completions_tokens": 10000,
            "tools": [
                {
                    "type": "mcp",
                    "server_label": "gitmcp",
                    "server_url": "https://gitmcp.io/openai/tiktoken",
                    "allowed_tools": [
                        "search_tiktoken_documentation",
                        "fetch_tiktoken_documentation",
                    ],
                    "require_approval": "never",
                }
            ],
        },
    },
)

轮询完成并检索输出

启动评估运行后,我们可以轮询运行直到它们完成。此步骤确保我们仅在所有模型响应都已处理后才分析结果。

最佳实践: 轮询时设置延迟可避免过多的 API 调用,并确保资源使用效率。

def poll_runs(eval_id, run_ids):
    while True:
        runs = [client.evals.runs.retrieve(rid, eval_id=eval_id) for rid in run_ids]
        for run in runs:
            print(run.id, run.status, run.result_counts)
        if all(run.status in {"completed", "failed"} for run in runs):
            break
        time.sleep(5)

# 开始轮询两个运行。
poll_runs(logs_eval.id, [gpt_4one_responses_run.id, gpt_o4_mini_responses_run.id])
evalrun_684769b577488191863b5a51cf4db57a completed ResultCounts(errored=0, failed=5, passed=0, total=5)
evalrun_684769c1ad9c8191affea5aa02ef1215 completed ResultCounts(errored=0, failed=3, passed=2, total=5)

显示和解释模型输出

最后,我们显示每个模型的输出以供手动检查和进一步分析。

  • 每个模型的答案都会针对数据集中的每个问题打印出来。
  • 您可以并排比较输出,以评估质量、相关性和正确性。

下面是 OpenAI Evals Dashboard 中说明两个模型评估输出的屏幕截图:

Evaluation Output

有关评估指标和结果的全面细分,请导航到仪表板中的“数据”选项卡:

Evaluation Data Tab

请注意,4.1 模型被构建为从不使用其工具来回答查询,因此它从未调用 MCP 服务器。o4-mini 模型也没有被明确指示使用其工具,但也没有被禁止,因此它调用了 MCP 服务器 3 次。我们可以看到 4.1 模型表现不如 o4 模型。同样值得注意的是,o4-mini 模型失败的一个示例是未使用 MCP 工具的示例。

我们还可以检查每个模型的输出的详细分析,以供手动检查和进一步分析。

four_one_output = client.evals.runs.output_items.list(
    run_id=gpt_4one_responses_run.id, eval_id=logs_eval.id
)

o4_mini_output = client.evals.runs.output_items.list(
    run_id=gpt_o4_mini_responses_run.id, eval_id=logs_eval.id
)
print('# gpt‑4.1 Output')
for item in four_one_output:
    print(item.sample.output[0].content)

print('\n# o4-mini Output')
for item in o4_mini_output:
    print(item.sample.output[0].content)
# gpt‑4.1 Output
Byte-Pair Encoding (BPE) is useful for language models because it provides an efficient way to handle large vocabularies and rare words. Here’s why it is valuable:

1. **Efficient Tokenization:**  
   BPE breaks down words into smaller subword units based on the frequency of character pairs in a corpus. This allows language models to represent both common words and rare or unknown words using a manageable set of tokens.

2. **Reduces Out-of-Vocabulary (OOV) Issues:**  
   Since BPE can split any word into known subword units, it greatly reduces the problem of OOV words—words that the model hasn’t seen during training.

3. **Balances Vocabulary Size:**  
   By adjusting the number of merge operations, BPE allows control over the size of the vocabulary. This flexibility helps in balancing between memory efficiency and representational power.

4. **Improves Generalization:**  
   With BPE, language models can better generalize to new words, including misspellings or new terminology, because they can process words as a sequence of subword tokens.

5. **Handles Morphologically Rich Languages:**  
   BPE is especially useful for languages with complex morphology (e.g., agglutinative languages) where words can have many forms. BPE reduces the need to memorize every possible word form.

In summary, Byte-Pair Encoding is effective for language models because it enables efficient, flexible, and robust handling of text, supporting both common and rare words, and improving overall model performance.
**Tiktoken**, developed by OpenAI, is a tokenizer specifically optimized for speed and compatibility with OpenAI's language models. Here’s how it generally compares to other popular tokenizers:

### Performance

- **Speed:** Tiktoken is significantly faster than most other Python-based tokenizers. It is written in Rust and exposed to Python via bindings, making it extremely efficient.
- **Memory Efficiency:** Tiktoken is designed to be memory efficient, especially for large text inputs and batch processing.

### Accuracy and Compatibility

- **Model Alignment:** Tiktoken is tailored to match the tokenization logic used by OpenAI’s GPT-3, GPT-4, and related models. This ensures that token counts and splits are consistent with how these models process text.
- **Unicode Handling:** Like other modern tokenizers (e.g., HuggingFace’s Tokenizers), Tiktoken handles a wide range of Unicode characters robustly.

### Comparison to Other Tokenizers

- **HuggingFace Tokenizers:** HuggingFace’s library is very flexible and supports a wide range of models (BERT, RoBERTa, etc.). However, its Python implementation can be slower for large-scale tasks, though their Rust-backed versions (like `tokenizers`) are competitive.
- **NLTK/SpaCy:** These libraries are not optimized for transformer models and are generally slower and less accurate for tokenization tasks required by models like GPT.
- **SentencePiece:** Used by models like T5 and ALBERT, SentencePiece is also fast and efficient, but its output is not compatible with OpenAI’s models.

### Use Cases

- **Best for OpenAI Models:** If you are working with OpenAI’s APIs or models, Tiktoken is the recommended tokenizer due to its speed and alignment.
- **General Purpose:** For non-OpenAI models, HuggingFace or SentencePiece might be preferable due to their versatility.

### Benchmarks & Community Feedback

- Multiple [community benchmarks](https://github.com/openai/tiktoken#performance) and [blog posts](https://www.philschmid.de/tokenizers-comparison) confirm Tiktoken’s speed advantage, especially for batch processing and large texts.

**Summary:**  
Tiktoken outperforms most tokenizers in speed when used with OpenAI models, with robust Unicode support and memory efficiency. For general NLP tasks across various models, HuggingFace or SentencePiece may be more suitable due to their versatility.

**References:**

- [Tiktoken GitHub - Performance](https://github.com/openai/tiktoken#performance)
- [Tokenizers Comparison Blog](https://www.philschmid.de/tokenizers-comparison)

Let me know if you need an example for a specific model!
To get the tokenizer for a specific OpenAI model, you typically use the Hugging Face Transformers library, which provides easy access to tokenizers for OpenAI models like GPT-3, GPT-4, and others. Here’s how you can do it:
  1. Using Hugging Face Transformers:

    Install the library (if you haven’t already): bash pip install transformers

    Example for GPT-3 (or GPT-4): ```python from transformers import AutoTokenizer

    For GPT-3 (davinci), use the corresponding model name

    tokenizer = AutoTokenizer.from_pretrained("openai-gpt")

    For GPT-4 (if available)

    tokenizer = AutoTokenizer.from_pretrained("gpt-4")

    ```

  2. Using OpenAI’s tiktoken library (for OpenAI API models):

    Install tiktoken: bash pip install tiktoken

    Example for GPT-3.5-turbo or GPT-4: ```python import tiktoken

    For 'gpt-3.5-turbo'

    tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")

    For 'gpt-4'

    tokenizer = tiktoken.encoding_for_model("gpt-4")

    ```

    Summary:

    • Use transformers.AutoTokenizer for Hugging Face models.
    • Use tiktoken.encoding_for_model for OpenAI API models.

    References:

    Let me know if you need an example for a specific model! To install the open-source version of tiktoken, you can use Python’s package manager, pip. The open-source version is available on PyPI, so you can install it easily with the following command:

pip install tiktoken
If you want to install the latest development version directly from the GitHub repository, you can use:

```bash
pip install git+https://github.com/openai/tiktoken.git
```

**Requirements:**

- Python 3.7 or newer
- pip (Python package installer)

**Steps:**

1. Open your terminal or command prompt.
2. Run one of the above commands.
3. Once installed, you can import and use `tiktoken` in your Python scripts.

**Additional Resources:**

- [tiktoken GitHub repository](https://github.com/openai/tiktoken)
- [tiktoken documentation](https://github.com/openai/tiktoken#readme)

Let me know if you need help with a specific operating system or environment!
Tiktoken is a fast and efficient tokenization library developed by OpenAI, primarily used for handling text input and output with language models such as GPT-3 and GPT-4. Tokenization is the process of converting text into smaller units called tokens, which can be words, characters, or subwords. Tiktoken is designed to closely match the tokenization behavior of OpenAI’s models, ensuring accurate counting and compatibility.

Key features of tiktoken:

- **Speed:** It’s written in Rust for performance and has Python bindings.
- **Compatibility:** Matches the exact tokenization used by OpenAI models, which is important for estimating token counts and costs.
- **Functionality:** Allows users to encode (convert text to tokens) and decode (convert tokens back to text).

Tiktoken is commonly used in applications that need to interact with OpenAI’s APIs, for tasks like counting tokens to avoid exceeding API limits or optimizing prompt length. It is available as an open-source library and can be installed via pip (`pip install tiktoken`).

# o4-mini Output
Here’s a high-level comparison of OpenAI’s tiktoken vs. some of the other commonly used tokenizers:

1. Implementation & Language Support  
   • tiktoken  
     – Rust core with Python bindings.  
     – Implements GPT-2/GPT-3/GPT-4 byte-pair-encoding (BPE) vocabularies.  
     – Focused on English-centric BPE; no built-in support for CJK segmentation or languages requiring character-level tokenization.  
   • Hugging Face Tokenizers (“tokenizers” library)  
     – Also Rust core with Python bindings.  
     – Supports BPE, WordPiece, Unigram (SentencePiece), Metaspace, and custom vocabularies.  
     – Broader multilingual and subword model support.  
   • Python-only Tokenizers (e.g. GPT-2 BPE in pure Python)  
     – Much slower, larger memory overhead, not suitable for high-throughput use.

2. Speed & Throughput  
   • tiktoken  
     – Benchmarks (OpenAI-internal) on a single CPU core: ~1–2 million tokens/second.  
     – Roughly 10–20× faster than pure-Python GPT-2 BPE implementations.  
     – Roughly 2–4× faster (or on par) with Hugging Face’s Rust tokenizers when using identical BPE models.  
   • Hugging Face Tokenizers  
     – In the same ballpark as tiktoken for a given BPE vocab (hundreds of thousands to a million tokens/sec).  
     – Slightly higher startup overhead when loading models, but offers more tokenization strategies.  
   • SentencePiece (C++) / Python bindings  
     – Generally slower than Rust-based (tiktoken, tokenizers) – on the order of 100–300 K tokens/sec.

3. Memory & Footprint  
   • tiktoken  
     – Tiny binary (~1–2 MB) plus vocab files (~50 MB).  
     – Low working memory; ideal for lightweight embedding or inference pipelines.  
   • Hugging Face Tokenizers  
     – Slightly larger binary (~3–5 MB) plus model files.  
     – Offers on-disk memory-mapping for very large vocabularies.  
   • Python-only  
     – Larger RAM footprint during init; slower GC pauses.

4. Feature Set & Flexibility  
   • tiktoken  
     – “Batteries included” for OpenAI model vocabularies: GPT-2, Codex, GPT-3.5, GPT-4.  
     – Simple API: encode/decode, count tokens.  
     – No training or custom-vocab routines.  
   • Hugging Face Tokenizers  
     – Train new tokenizers (BPE, WordPiece, Unigram).  
     – Pre- and post-processing pipelines (normalization, special tokens).  
     – Easy integration with Transformers.  
   • Other libraries (NLTK, spaCy, jieba, etc.)  
     – Not directly comparable, since many perform linguistic tokenization, not subword BPE.  
     – Far slower for BPE-style byte-pair encoding.

5. When to Use Which  
   • tiktoken  
     – If you’re targeting OpenAI’s GPT-family models and need maximum raw throughput/count accuracy.  
     – You don’t need to train a new tokenizer or handle exotic language scripts.  
   • Hugging Face Tokenizers  
     – If you need broad language support, multiple subword algorithms, training tools, or tight HF Transformers integration.  
   • Python-only / Other  
     – Only if you have trivial performance needs or are experimenting in pure-Python teaching/demo settings.

Bottom line: for GPT-style BPE tokenization at scale, tiktoken is one of the fastest and most lightweight options—substantially faster than any pure-Python implementation and roughly on par (or a bit faster) than other Rust-backed libraries, at the cost of supporting only OpenAI’s pre-built vocabularies.
Tiktoken is the open-source tokenization library that OpenAI uses to convert between text and the integer “tokens” their models (GPT-3, GPT-4, etc.) actually consume. It implements byte-pair encoding (BPE) in Rust (with Python bindings) for maximum speed and exact compatibility with OpenAI’s APIs.

Key points:

1. Purpose  
   • Language models work on token IDs, not raw text.  
   • Tiktoken maps Unicode text ↔ token IDs using the same vocabularies and BPE merges that OpenAI’s models were trained on.

2. Performance  
   • Typically 3–6× faster than other BPE tokenizers (e.g. Hugging Face’s GPT2TokenizerFast).  
   • Handles gigabytes of text in seconds.

3. Installation  
   pip install tiktoken

4. Basic usage
import tiktoken

# Get a specific encoding (vocabulary + merges)
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Hello, world!")
text   = enc.decode(tokens)
assert text == "Hello, world!"

# Or auto-select by OpenAI model name
enc = tiktoken.encoding_for_model("gpt-4o")  # e.g. returns cl100k_base under the hood
5. Why BPE?  
   • Reversible and lossless  
   • Handles any text (even unseen words) by splitting into subword units  
   • Compresses common substrings (e.g. “ing”, “tion”) so the model sees familiar chunks

6. Extras  
   • Educational module (tiktoken._educational) to visualize or train simple BPEs  
   • Extension mechanism (tiktoken_ext) to register custom encodings

7. Where to learn more  
   • GitHub: https://github.com/openai/tiktoken  
   • PyPI: https://pypi.org/project/tiktoken  
   • OpenAI Cookbook example: How to count tokens with tiktoken

In short, if you’re building or billing on token usage with OpenAI’s models, tiktoken is the official, fast, and exact way to go from text ↔ tokens.
Here are the two easiest ways to get the open-source tiktoken up and running:

1. Install the released package from PyPI  
   • (no Rust toolchain needed—prebuilt wheels for most platforms)
pip install tiktoken
   Then in Python:  
   ```python
   import tiktoken
   enc = tiktoken.get_encoding("cl100k_base")
   print(enc.encode("Hello, world!"))
   ```

2. Install the bleeding-edge version straight from GitHub  
   • (you’ll need a Rust toolchain—on macOS `brew install rust`, on Ubuntu `sudo apt install cargo`)
pip install git+https://github.com/openai/tiktoken.git@main
   Or, if you prefer to clone & develop locally:  
   ```bash
   git clone https://github.com/openai/tiktoken.git
   cd tiktoken
   pip install -e .
   ```

That’s it! Once installed, you can use `tiktoken.get_encoding(...)` to load any of the supported tokenizers.
To get the exact tokenizer (BPE encoding) that an OpenAI model uses, you can use the open-source tiktoken library. It provides a helper that maps model names to their correct tokenizers:

1. Install tiktoken
pip install tiktoken
2. In Python, call encoding_for_model(model_name):
import tiktoken

#—for a gpt-3.5-turbo or gpt-4 style model:
enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
print(enc.name)            # e.g. "cl100k_base"
print(enc.encode("Hello")) # list of token IDs
   If you already know the encoding name (e.g. “cl100k_base” for GPT-3.5/4 or “r50k_base” for GPT-2), you can also do:
   ```python
   enc = tiktoken.get_encoding("cl100k_base")
   ```

3. In Node.js / JavaScript, use the tiktoken npm package the same way:
import { encoding_for_model } from "tiktoken";

const enc = await encoding_for_model("gpt-3.5-turbo");
console.log(enc.name);       // "cl100k_base"
console.log(enc.encode("Hi")); // array of token IDs
Under the hood encoding_for_model knows which BPE schema (“r50k_base”, “cl100k_base”, etc.) each OpenAI model uses and returns the right tokenizer instance.
Byte-Pair Encoding (BPE) has become the de-facto subword tokenization method in modern language models because it strikes a practical balance between fixed, closed vocabularies (word-level tokenizers) and open, but very long sequences (character-level tokenizers).  In particular:

1. Open-vocabulary coverage  
   • Learns subword units from your corpus by iteratively merging the most frequent byte (or character) pairs.  
   • Can represent any new or rare word as a sequence of known subwords—no “unknown token” blowups.

2. Compact vocabulary size  
   • Vocabulary sizes on the order of 20K–100K tokens capture very common words as single tokens and rare or morphologically complex words as a few subwords.  
   • Keeps softmax layers and embedding tables manageable in size.

3. Reduced data sparsity  
   • Shares subwords among many words (e.g. “play,” “playing,” “replay”).  
   • Provides better statistical estimates (fewer zero‐count tokens) and faster convergence in training.

4. Morphological and cross-lingual adaptability  
   • Naturally splits on morpheme or syllable boundaries when those are frequent in the data.  
   • Can be trained on multilingual corpora to share subwords across related languages.

5. Speed and simplicity  
   • Linear-time, greedy encoding of new text (just look up merges).  
   • Deterministic and invertible: you can reconstruct the original byte sequence exactly.

In short, BPE tokenization gives you a small, fixed-size vocabulary that still generalizes to unseen words, reduces training and memory costs, and improves statistical efficiency—key ingredients for high-quality, scalable language models.

How can we improve?

If we add the phrase "Always use your tools since they are the way to get the right answer in this task." to the system message of the o4-mini model, what do you think will happen? (try it out)




If you guessed that the model would now call to MCP tool everytime and get every answer correct, you are right!

Evaluation Data Tab Evaluation Data Tab

In this notebook, we demonstrated a sample workflow for evaluating the ability of LLMs to answer technical questions about the tiktoken repository using the OpenAI Evals framework leveraging MCP tooling.

Key points covered:

  • Defined a focused, custom dataset for evaluation.
  • Configured LLM-based and Python-based graders for robust assessment.
  • Compared two models (gpt-4.1 and o4-mini) in a reproducible and transparent manner.
  • Retrieved and displayed model outputs for automated/manual inspection.

Next steps:

  • Expand the dataset: Add more diverse and challenging questions to better assess model capabilities.
  • Analyze results: Summarize pass/fail rates, visualize performance, or perform error analysis to identify strengths and weaknesses.
  • Experiment with models/tools: Try additional models, adjust tool configurations, or test on other repositories.
  • Automate reporting: Generate summary tables or plots for easier sharing and decision-making.

For more information, check out the OpenAI Evals documentation.