|
| 1 | +# Evaluation Implementation Guide |
| 2 | + |
| 3 | +This guide explains how to create evaluation tests (`.eval.ts` files) for testing AI model interactions with specific tools or systems, such as Cloudflare Worker bindings or container environments. |
| 4 | + |
| 5 | +## What are Evals? |
| 6 | + |
| 7 | +Evals are automated tests designed to verify if an AI model correctly understands instructions and utilizes its available "tools" (functions, API calls, environment interactions) to achieve a desired outcome. They assess the model's ability to follow instructions, select appropriate tools, and provide correct arguments to those tools. |
| 8 | + |
| 9 | +## Core Concepts |
| 10 | + |
| 11 | +Evals are typically built using a testing framework like `vitest` combined with specialized evaluation libraries like `vitest-evals`. The main structure revolves around `describeEval`: |
| 12 | + |
| 13 | +```typescript |
| 14 | +import { expect } from 'vitest' |
| 15 | +import { describeEval } from 'vitest-evals' |
| 16 | + |
| 17 | +import { checkFactuality } from '@repo/eval-tools/src/scorers' |
| 18 | +import { eachModel } from '@repo/eval-tools/src/test-models' |
| 19 | + |
| 20 | +import { initializeClient, runTask } from './utils' // Helper functions |
| 21 | +
|
| 22 | +eachModel('$modelName', ({ model }) => { |
| 23 | + // Optional: Run tests for multiple models |
| 24 | + describeEval('A descriptive name for the evaluation suite', { |
| 25 | + data: async () => [ |
| 26 | + /* Test cases */ |
| 27 | + ], |
| 28 | + task: async (input) => { |
| 29 | + /* Test logic */ |
| 30 | + }, |
| 31 | + scorers: [ |
| 32 | + /* Scoring functions */ |
| 33 | + ], |
| 34 | + threshold: 1, // Passing score threshold |
| 35 | + timeout: 60000, // Test timeout |
| 36 | + }) |
| 37 | +}) |
| 38 | +``` |
| 39 | + |
| 40 | +### Key Parts: |
| 41 | + |
| 42 | +1. **`describeEval(name, options)`**: Defines a suite of evaluation tests. |
| 43 | + |
| 44 | + - `name`: A string describing the purpose of the eval suite. |
| 45 | + - `options`: An object containing the configuration for the eval: |
| 46 | + - **`data`**: An async function returning an array of test case objects. Each object typically contains: |
| 47 | + - `input`: (string) The instruction given to the AI model. |
| 48 | + - `expected`: (string) A natural language description of the _expected_ sequence of actions or outcome. This is used by scorers. |
| 49 | + - **`task`**: An async function that executes the actual test logic for a given `input`. It orchestrates the interaction with the AI/system and performs assertions. |
| 50 | + - **`scorers`**: An array of scoring functions (e.g., `checkFactuality`) that evaluate the test outcome based on the `promptOutput` from the `task` and the `expected` string from the `data`. |
| 51 | + - **`threshold`**: (number, usually between 0 and 1) The minimum score required from the scorers for the test case to pass. A threshold of `1` means a perfect score is required. |
| 52 | + - **`timeout`**: (number) Maximum time in milliseconds allowed for a single test case. |
| 53 | + |
| 54 | +2. **`task(input)` Function**: The heart of the eval. It typically involves: |
| 55 | + |
| 56 | + - **Setup**: Initializing a client or test environment (`initializeClient`). This prepares the system for the test, configuring available tools or connections. |
| 57 | + - **Execution**: Running the actual interaction (`runTask`). This function sends the `input` instruction to the AI model via the client and captures the results, which usually include: |
| 58 | + - `promptOutput`: The textual response from the AI model. |
| 59 | + - `toolCalls`: A structured list of the tools the AI invoked, along with the arguments passed to each tool. |
| 60 | + - **Assertions (`expect`)**: Using the testing framework's assertion library (`vitest`'s `expect` in the examples) to verify that the correct tools were called with the correct arguments based on the `toolCalls` data. Sometimes, this involves direct interaction with the system state (e.g., reading a file created by a tool) to confirm the outcome. |
| 61 | + - **Return Value**: The `task` function usually returns the `promptOutput` to be evaluated by the `scorers`. |
| 62 | + |
| 63 | +3. **Scoring (`checkFactuality`, etc.)**: Automated functions that compare the actual outcome (represented by the `promptOutput` and implicitly by the assertions passed within the `task`) against the `expected` description. |
| 64 | + |
| 65 | +4. **Helper Utilities (`./utils`)**: |
| 66 | + - `initializeClient()`: Sets up the testing environment, connects to the system under test, and configures the available tools for the AI model. |
| 67 | + - `runTask(client, model, input)`: Sends the input prompt to the specified AI model using the configured client, executes the model's reasoning and tool use, and returns the results (`promptOutput`, `toolCalls`). |
| 68 | + - `eachModel()`: (Optional) A utility to run the same evaluation suite against multiple different AI models. |
| 69 | + |
| 70 | +## Steps to Implement Evals |
| 71 | + |
| 72 | +1. **Identify Tools:** Define the specific actions or functions (the "tools") that the AI should be able to use within the system you're testing (e.g., `kv_write`, `d1_query`, `container_exec`). |
| 73 | +2. **Create Helper Functions:** Implement your `initializeClient` and `runTask` (or similarly named) functions. |
| 74 | + - `initializeClient`: Should set up the necessary context, potentially using test environments like `vitest-environment-miniflare` for workers. It needs to make the defined tools available to the AI model simulation. |
| 75 | + - `runTask`: Needs to simulate the AI processing: take an input prompt, interact with an LLM (or a mock) configured with the tools, capture which tools are called and with what arguments, and capture the final text output. |
| 76 | +3. **Create Eval File (`*.eval.ts`):** Create a new file (e.g., `kv-operations.eval.ts`). |
| 77 | +4. **Import Dependencies:** Import `describeEval`, scorers, helpers, `expect`, etc. |
| 78 | +5. **Structure with `describeEval`:** Define your evaluation suite. |
| 79 | +6. **Define Test Cases (`data`):** Write specific test scenarios: |
| 80 | + - Provide clear, unambiguous `input` prompts that target the tools you want to test. |
| 81 | + - Write concise `expected` descriptions detailing the primary tool calls or outcomes anticipated. |
| 82 | +7. **Implement the `task` Function:** |
| 83 | + - Call `initializeClient`. |
| 84 | + - Call `runTask` with the `input`. |
| 85 | + - Write `expect` assertions to rigorously check: |
| 86 | + - Were the correct tools called? (`toolName`) |
| 87 | + - Were they called in the expected order (if applicable)? |
| 88 | + - Were the arguments passed to the tools correct? (`args`) |
| 89 | + - (Optional) Interact with the system state if necessary to verify side effects. |
| 90 | + - Return the `promptOutput`. |
| 91 | +8. **Configure Scorers and Threshold:** Choose appropriate scorers (often `checkFactuality`) and set a `threshold`. |
| 92 | +9. **Run Tests:** Execute the evals using your test runner (e.g., `vitest run`). |
| 93 | + |
| 94 | +## Example Structure (Simplified) |
| 95 | + |
| 96 | +```typescript |
| 97 | +// my-feature.eval.ts |
| 98 | +import { expect } from 'vitest' |
| 99 | +import { describeEval } from 'vitest-evals' |
| 100 | + |
| 101 | +import { checkFactuality } from '@repo/eval-tools/src/scorers' |
| 102 | + |
| 103 | +import { initializeClient, runTask } from './utils' |
| 104 | + |
| 105 | +describeEval('Tests My Feature Tool Interactions', { |
| 106 | + data: async () => [ |
| 107 | + { |
| 108 | + input: 'Use my_tool to process the data "example"', |
| 109 | + expected: 'The my_tool tool was called with data set to "example"', |
| 110 | + }, |
| 111 | + // ... more test cases |
| 112 | + ], |
| 113 | + task: async (input) => { |
| 114 | + const client = await initializeClient() // Sets up environment with my_tool |
| 115 | + const { promptOutput, toolCalls } = await runTask(client, 'your-model', input) |
| 116 | + |
| 117 | + // Check if my_tool was called |
| 118 | + const myToolCall = toolCalls.find((call) => call.toolName === 'my_tool') |
| 119 | + expect(myToolCall).toBeDefined() |
| 120 | + |
| 121 | + // Check arguments passed to my_tool |
| 122 | + expect(myToolCall?.args).toEqual( |
| 123 | + expect.objectContaining({ |
| 124 | + data: 'example', |
| 125 | + // ... other expected args |
| 126 | + }) |
| 127 | + ) |
| 128 | + |
| 129 | + return promptOutput // Return AI output for scoring |
| 130 | + }, |
| 131 | + scorers: [checkFactuality], |
| 132 | + threshold: 1, |
| 133 | +}) |
| 134 | +``` |
| 135 | + |
| 136 | +## Best Practices |
| 137 | + |
| 138 | +- **Clear Inputs:** Write inputs as clear, actionable instructions. |
| 139 | +- **Specific Expected Outcomes:** Make `expected` descriptions precise enough for scorers but focus on the key actions. |
| 140 | +- **Targeted Assertions:** Use `expect` to verify the most critical aspects of tool calls (tool name, key arguments). Don't over-assert on trivial details unless necessary. |
| 141 | +- **Isolate Tests:** Ensure each test case in `data` tests a specific interaction or a small sequence of interactions. |
| 142 | +- **Helper Functions:** Keep `initializeClient` and `runTask` generic enough to be reused across different eval files for the same system. |
| 143 | +- **Use `expect.objectContaining` or `expect.stringContaining`:** Often, you only need to verify _parts_ of the arguments, not the entire structure, making tests less brittle. |
| 144 | +- **Descriptive Names:** Use clear names for `describeEval` blocks and meaningful `input`/`expected` strings. |
0 commit comments