Contract structure
A contract has three parts: the tool name, assertions, and golden cases.
# Does my AI call the right tool? tool: "get_weather" assertions: output_invariants: - path: "$.tool_calls[0].name" equals: "get_weather" - path: "$.tool_calls[0].arguments" exists: true golden_cases: - id: "tool_call_success" input_ref: "tool_call.success.json" expect_ok: true - id: "tool_not_invoked" input_ref: "negative/tool_call.not_invoked.json" expect_ok: false expected_error: "tool_not_invoked" provider_modes: ["recorded"]
tool
The tool name the model should call. ReplayCI checks that the model's response contains a tool call matching this name.
assertions
JSON path checks on the model's response. Two types:
- output_invariants — checks on the model's response (tool call name, arguments)
- input_invariants — checks on the request sent to the provider (messages array, tools array)
| Field | Description |
|---|---|
path | JSON path expression (e.g. $.tool_calls[0].name) |
equals | Exact value match |
exists | Check that the path exists (true/false) |
type | Expected type ("array", "object", "string", "number") |
golden_cases
Specific input/output pairs to test. Each golden case references a fixture file.
| Field | Description |
|---|---|
id | Unique identifier for this test case |
input_ref | Path to fixture file (relative to golden directory) |
expect_ok | Whether this case should pass (true) or fail (false) |
expected_error | For failure cases: the expected failure classification |
provider_modes | Restrict to specific modes: ["recorded"] for offline-only cases |
Golden fixture files
A golden fixture is a JSON file containing the request and expected response:
{
"boundary": {
"model": "gpt-4o-mini",
"tool_schema_hash": "starter_tool_001",
"message_hash": "starter_msg_001",
"system_hash": "starter_sys_001"
},
"request": {
"model": "gpt-4o-mini",
"messages": [
{ "role": "system", "content": "You are a helpful assistant..." },
{ "role": "user", "content": "What is the weather in San Francisco?" }
],
"tools": [{ "type": "function", "function": { "name": "get_weather", ... } }]
},
"response": {
"success": true,
"tool_calls": [
{ "id": "call_001", "type": "function",
"function": { "name": "get_weather", "arguments": "{\"location\": \"San Francisco\"}" } }
],
"content": null
}
}
- boundary — identifies the test context, used for fingerprinting and determinism proof.
- request — the full request payload sent to the provider (OpenAI chat completions format).
- response — the expected response. For success:
tool_callsarray. For failure: a response that violates assertions.
Negative test cases
Negative cases test that ReplayCI correctly detects failures. Place negative fixtures in a negative/ subdirectory.
Example: the model returns text instead of calling the tool:
{
"response": {
"success": true,
"tool_calls": [],
"content": "I don't have access to real-time weather data."
}
}
Mark negative cases with expect_ok: false and expected_error in the contract. The provider_modes: ["recorded"] restriction means this case only runs against recorded fixtures.
Failure classifications
| Classification | Meaning |
|---|---|
tool_not_invoked | Model returned text instead of calling the tool |
malformed_arguments | Tool arguments aren't valid JSON |
schema_violation | Arguments don't match the expected schema |
wrong_tool | Model called a different tool than expected |
unexpected_error | Provider returned an error |
Each failure also gets a fingerprint — a short hash that identifies the specific failure pattern.
Pack structure
packs/my-pack/
pack.yaml # Pack metadata
nr-allowlist.json # Non-reproducible allowlist (can be empty: [])
contracts/
tool_call.yaml # Contract files
search.yaml
golden/
tool_call.success.json # Success fixtures
search.success.json
negative/
tool_call.not_invoked.json # Negative fixtures
pack.yaml
pack_id: "my-pack" name: "my-tool-calling-tests" version: "0.1.0" schema_version: "v0.1" provider: "openai" default_model: "gpt-4o-mini" paths: contracts_dir: "packs/my-pack/contracts" golden_dir: "packs/my-pack/golden" negative_golden_dir: "packs/my-pack/golden/negative" contracts: - tool_call.yaml - search.yaml
Optional contract fields
timeouts: total_ms: 30000 retries: max_attempts: 2 retry_on: - "429" - "5xx" - "timeout" rate_limits: on_429: respect_retry_after: true max_sleep_seconds: 60
| Field | Description |
|---|---|
timeouts.total_ms | Maximum time for the provider call (default: 30000) |
retries.max_attempts | Number of retries on transient errors |
retries.retry_on | Error types that trigger a retry |
rate_limits.on_429 | How to handle rate limit responses |
Running your tests
# Using .replayci.yml npx replayci # With CLI flag override npx replayci --pack packs/my-pack # Against recorded fixtures (offline, deterministic) npx replayci --pack packs/my-pack --provider recorded # Against a live provider npx replayci --pack packs/my-pack --provider openai --model gpt-4o-mini
See CLI Reference for all flags.