feat: add implementation guides to assist in AI-driven development

deloreyj · deloreyj · commit 895162f07e99 · 2025-04-28T16:47:27.000-05:00
diff --git a/implementation-guides/evals.md b/implementation-guides/evals.md
@@ -0,0 +1,144 @@
+# Evaluation Implementation Guide
+
+This guide explains how to create evaluation tests (`.eval.ts` files) for testing AI model interactions with specific tools or systems, such as Cloudflare Worker bindings or container environments.
+
+## What are Evals?
+
+Evals are automated tests designed to verify if an AI model correctly understands instructions and utilizes its available "tools" (functions, API calls, environment interactions) to achieve a desired outcome. They assess the model's ability to follow instructions, select appropriate tools, and provide correct arguments to those tools.
+
+## Core Concepts
+
+Evals are typically built using a testing framework like `vitest` combined with specialized evaluation libraries like `vitest-evals`. The main structure revolves around `describeEval`:
+
+```typescript
+import { expect } from 'vitest'
+import { describeEval } from 'vitest-evals'
+
+import { checkFactuality } from '@repo/eval-tools/src/scorers'
+import { eachModel } from '@repo/eval-tools/src/test-models'
+
+import { initializeClient, runTask } from './utils' // Helper functions
+
+eachModel('$modelName', ({ model }) => {
+	// Optional: Run tests for multiple models
+	describeEval('A descriptive name for the evaluation suite', {
+		data: async () => [
+			/* Test cases */
+		],
+		task: async (input) => {
+			/* Test logic */
+		},
+		scorers: [
+			/* Scoring functions */
+		],
+		threshold: 1, // Passing score threshold
+		timeout: 60000, // Test timeout
+	})
+})
+```
+
+### Key Parts:
+
+1.  **`describeEval(name, options)`**: Defines a suite of evaluation tests.
+
+    - `name`: A string describing the purpose of the eval suite.
+    - `options`: An object containing the configuration for the eval:
+      - **`data`**: An async function returning an array of test case objects. Each object typically contains:
+        - `input`: (string) The instruction given to the AI model.
+        - `expected`: (string) A natural language description of the _expected_ sequence of actions or outcome. This is used by scorers.
+      - **`task`**: An async function that executes the actual test logic for a given `input`. It orchestrates the interaction with the AI/system and performs assertions.
+      - **`scorers`**: An array of scoring functions (e.g., `checkFactuality`) that evaluate the test outcome based on the `promptOutput` from the `task` and the `expected` string from the `data`.
+      - **`threshold`**: (number, usually between 0 and 1) The minimum score required from the scorers for the test case to pass. A threshold of `1` means a perfect score is required.
+      - **`timeout`**: (number) Maximum time in milliseconds allowed for a single test case.
+
+2.  **`task(input)` Function**: The heart of the eval. It typically involves:
+
+    - **Setup**: Initializing a client or test environment (`initializeClient`). This prepares the system for the test, configuring available tools or connections.
+    - **Execution**: Running the actual interaction (`runTask`). This function sends the `input` instruction to the AI model via the client and captures the results, which usually include:
+      - `promptOutput`: The textual response from the AI model.
+      - `toolCalls`: A structured list of the tools the AI invoked, along with the arguments passed to each tool.
+    - **Assertions (`expect`)**: Using the testing framework's assertion library (`vitest`'s `expect` in the examples) to verify that the correct tools were called with the correct arguments based on the `toolCalls` data. Sometimes, this involves direct interaction with the system state (e.g., reading a file created by a tool) to confirm the outcome.
+    - **Return Value**: The `task` function usually returns the `promptOutput` to be evaluated by the `scorers`.
+
+3.  **Scoring (`checkFactuality`, etc.)**: Automated functions that compare the actual outcome (represented by the `promptOutput` and implicitly by the assertions passed within the `task`) against the `expected` description.
+
+4.  **Helper Utilities (`./utils`)**:
+    - `initializeClient()`: Sets up the testing environment, connects to the system under test, and configures the available tools for the AI model.
+    - `runTask(client, model, input)`: Sends the input prompt to the specified AI model using the configured client, executes the model's reasoning and tool use, and returns the results (`promptOutput`, `toolCalls`).
+    - `eachModel()`: (Optional) A utility to run the same evaluation suite against multiple different AI models.
+
+## Steps to Implement Evals
+
+1.  **Identify Tools:** Define the specific actions or functions (the "tools") that the AI should be able to use within the system you're testing (e.g., `kv_write`, `d1_query`, `container_exec`).
+2.  **Create Helper Functions:** Implement your `initializeClient` and `runTask` (or similarly named) functions.
+    - `initializeClient`: Should set up the necessary context, potentially using test environments like `vitest-environment-miniflare` for workers. It needs to make the defined tools available to the AI model simulation.
+    - `runTask`: Needs to simulate the AI processing: take an input prompt, interact with an LLM (or a mock) configured with the tools, capture which tools are called and with what arguments, and capture the final text output.
+3.  **Create Eval File (`*.eval.ts`):** Create a new file (e.g., `kv-operations.eval.ts`).
+4.  **Import Dependencies:** Import `describeEval`, scorers, helpers, `expect`, etc.
+5.  **Structure with `describeEval`:** Define your evaluation suite.
+6.  **Define Test Cases (`data`):** Write specific test scenarios:
+    - Provide clear, unambiguous `input` prompts that target the tools you want to test.
+    - Write concise `expected` descriptions detailing the primary tool calls or outcomes anticipated.
+7.  **Implement the `task` Function:**
+    - Call `initializeClient`.
+    - Call `runTask` with the `input`.
+    - Write `expect` assertions to rigorously check:
+      - Were the correct tools called? (`toolName`)
+      - Were they called in the expected order (if applicable)?
+      - Were the arguments passed to the tools correct? (`args`)
+      - (Optional) Interact with the system state if necessary to verify side effects.
+    - Return the `promptOutput`.
+8.  **Configure Scorers and Threshold:** Choose appropriate scorers (often `checkFactuality`) and set a `threshold`.
+9.  **Run Tests:** Execute the evals using your test runner (e.g., `vitest run`).
+
+## Example Structure (Simplified)
+
+```typescript
+// my-feature.eval.ts
+import { expect } from 'vitest'
+import { describeEval } from 'vitest-evals'
+
+import { checkFactuality } from '@repo/eval-tools/src/scorers'
+
+import { initializeClient, runTask } from './utils'
+
+describeEval('Tests My Feature Tool Interactions', {
+	data: async () => [
+		{
+			input: 'Use my_tool to process the data "example"',
+			expected: 'The my_tool tool was called with data set to "example"',
+		},
+		// ... more test cases
+	],
+	task: async (input) => {
+		const client = await initializeClient() // Sets up environment with my_tool
+		const { promptOutput, toolCalls } = await runTask(client, 'your-model', input)
+
+		// Check if my_tool was called
+		const myToolCall = toolCalls.find((call) => call.toolName === 'my_tool')
+		expect(myToolCall).toBeDefined()
+
+		// Check arguments passed to my_tool
+		expect(myToolCall?.args).toEqual(
+			expect.objectContaining({
+				data: 'example',
+				// ... other expected args
+			})
+		)
+
+		return promptOutput // Return AI output for scoring
+	},
+	scorers: [checkFactuality],
+	threshold: 1,
+})
+```
+
+## Best Practices
+
+- **Clear Inputs:** Write inputs as clear, actionable instructions.
+- **Specific Expected Outcomes:** Make `expected` descriptions precise enough for scorers but focus on the key actions.
+- **Targeted Assertions:** Use `expect` to verify the most critical aspects of tool calls (tool name, key arguments). Don't over-assert on trivial details unless necessary.
+- **Isolate Tests:** Ensure each test case in `data` tests a specific interaction or a small sequence of interactions.
+- **Helper Functions:** Keep `initializeClient` and `runTask` generic enough to be reused across different eval files for the same system.
+- **Use `expect.objectContaining` or `expect.stringContaining`:** Often, you only need to verify _parts_ of the arguments, not the entire structure, making tests less brittle.
+- **Descriptive Names:** Use clear names for `describeEval` blocks and meaningful `input`/`expected` strings.
diff --git a/implementation-guides/tools.md b/implementation-guides/tools.md
@@ -0,0 +1,127 @@
+# MCP Tool Implementation Guide
+
+This guide explains how to implement and register tools within an MCP (Model Context Protocol) server, enabling AI models to interact with external systems, APIs, or specific functionalities like Cloudflare services.
+
+## Purpose of Tools
+
+Tools are the mechanism by which an MCP agent (powered by an LLM) can perform actions beyond generating text. They allow the agent to accomplish many tasks, including:
+
+- Interact with APIs (e.g., Cloudflare API, other REST APIs).
+- Query databases or vector stores (like Autorag).
+- Access environment resources (KV, R2, D1, Service Bindings).
+- Perform specific computations or data transformations.
+
+## Registering a Tool
+
+Tools are registered using the `agent.server.tool()` method.
+
+```typescript
+// Import your Zod schemas
+import { z } from 'zod'
+
+import { getCloudflareClient } from '../cloudflare-api'
+import { MISSING_ACCOUNT_ID_RESPONSE } from '../constants'
+import { type CloudflareMcpAgent } from '../types/cloudflare-mcp-agent'
+import { KvNamespaceIdSchema, KvNamespaceTitleSchema } from '../types/kv_namespace'
+
+export function registerMyServiceTools(agent: CloudflareMcpAgent) {
+	agent.server.tool(
+		'tool_name', // String: Unique name for the tool
+		'Detailed description', // String: Description for the LLM (CRITICAL!)
+		{
+			// Object: Parameter definitions using Zod schemas
+			param1: MyParam1Schema,
+			param2: MyParam2Schema.optional(),
+			// ... other parameters
+		},
+		async (params) => {
+			// Async Function: The implementation logic
+			// params contains the validated parameters { param1, param2, ... }
+
+			// --- Tool Logic Start ---
+			try {
+				// Access agent context if needed (e.g., account ID, credentials)
+				const account_id = await agent.getActiveAccountId()
+				if (!account_id) {
+					return MISSING_ACCOUNT_ID_RESPONSE // Handle missing context
+				}
+
+				// Perform the action (e.g., call SDK, query DB)
+				// const client = getCloudflareClient(agent.props.accessToken);
+				// const result = await client.someService.someAction(...);
+
+				// Format the successful response
+				return {
+					content: [
+						{
+							type: 'text',
+							text: JSON.stringify({ success: true /*, result */ }),
+						},
+						// Or potentially EmbeddedResource for richer data
+					],
+				}
+			} catch (error) {
+				// Format the error response
+				return {
+					content: [
+						{
+							type: 'text',
+							text: `Error performing action: ${error instanceof Error ? error.message : String(error)}`,
+						},
+					],
+				}
+			}
+			// --- Tool Logic End ---
+		}
+	)
+
+	// ... register other tools ...
+}
+```
+
+### Key Components:
+
+1.  **`toolName` (string):**
+
+    - A unique identifier for the tool.
+    - **Convention:** Use `snake_case`. Typically `service_noun_verb` (e.g., `kv_namespace_create`, `hyperdrive_config_list`, `docs_search`).
+
+2.  **`description` (string - Max 1024 chars):**
+
+    - **This is the MOST CRITICAL part for LLM interaction.** The LLM uses this description _exclusively_ to decide _when_ to use the tool and _what_ it does.
+    - **A good description should include:**
+      - **Core Purpose:** What does the tool _do_? (e.g., "List Hyperdrive configurations", "Search Cloudflare documentation").
+      - **When to Use:** Provide clear scenarios or user intents that should trigger this tool. Use bullet points or clear instructions. (e.g., "Use this when a user asks to see their Hyperdrive setups", "Use this tool when: a user asks about Cloudflare products; you need info on a feature; you are unsure how to use Cloudflare functionality; you are writing Workers code and need docs").
+      - **Inputs:** Briefly mention key inputs if not obvious from parameter names.
+      - **Outputs:** Briefly describe what the tool returns (e.g., "Returns a list of namespace objects", "Returns search results as embedded resources").
+      - **Example Workflows/Follow-ups (Optional but helpful):** Suggest how this tool fits into a larger task or what tools might be used next (e.g., "After creating a namespace with `kv_namespace_create`, you might bind it to a Worker.", "Use `hyperdrive_config_get` to view details before using `hyperdrive_config_edit`.").
+    - **Be specific and unambiguous.** Avoid jargon unless it's essential domain terminology the LLM should understand.
+    - **Keep it concise** while conveying necessary information.
+
+3.  **`parameters` (object):**
+
+    - An object mapping parameter names (keys) to their corresponding Zod schemas (values).
+    - Follow the principles outlined in the `implementation-guides/type-validators.md` guide
+
+4.  **`handlerFunction` (async function):**
+    - The asynchronous function that executes the tool's logic.
+    - It receives a single argument: an object (`params`) containing the validated parameters passed by the LLM, matching the keys defined in the `parameters` object.
+    - **Implementation Details:**
+      - **Access Context:** Use `agent.getActiveAccountId()`, `agent.props.accessToken`, `agent.env` (for worker bindings like AI, D1, R2) to get necessary credentials, environment variables, or bindings.
+      - **Error Handling:** Wrap the core logic in a `try...catch` block to gracefully handle failures (e.g., API errors, network issues, invalid inputs not caught by Zod).
+      - **Perform Action:** Interact with the relevant service (Cloudflare SDK, database, vector store, etc.).
+      - **Format Response:** Return an object with a `content` property, which is an array of `ContentBlock` objects (usually `type: 'text'` or `type: 'resource'`).
+        - For simple success/failure or structured data, `JSON.stringify` the result in a text block.
+        - For richer data like search results, use `EmbeddedResource` (`type: 'resource'`) as seen in `docs.ts`.
+        - Return clear error messages in the `text` property of a content block upon failure.
+
+## Best Practices
+
+- **Clear Descriptions are Paramount:** Invest time in writing excellent tool descriptions. This has the biggest impact on the LLM's ability to use tools effectively.
+- **Granular Tools:** Prefer smaller, focused tools over monolithic ones. (e.g., separate `_create`, `_list`, `_get`, `_update`, `_delete` tools for a resource).
+- **Robust Error Handling:** Anticipate potential failures and return informative error messages to the LLM.
+- **Consistent Naming:** Follow naming conventions for tools and parameters.
+- **Use Zod Validators:** Leverage Zod for input validation as described in the validator guide.
+- **Leverage Agent Context:** Use `agent.props`, `agent.env`, and helper methods like `agent.getActiveAccountId()` appropriately.
+- **Statelessness:** Aim for tools to be stateless where possible. Rely on parameters and agent context for necessary information.
+- **Security:** Be mindful of the actions tools perform, especially destructive ones (`delete`, `update`). Ensure proper authentication and authorization context is used (e.g., checking the active account ID).
diff --git a/implementation-guides/type-validators.md b/implementation-guides/type-validators.md