Contract structure

A contract has three parts: the tool name, assertions, and golden cases.

tool_call.yaml
# Does my AI call the right tool?

tool: "get_weather"

assertions:
  output_invariants:
    - path: "$.tool_calls[0].name"
      equals: "get_weather"
    - path: "$.tool_calls[0].arguments"
      exists: true

golden_cases:
  - id: "tool_call_success"
    input_ref: "tool_call.success.json"
    expect_ok: true

  - id: "tool_not_invoked"
    input_ref: "negative/tool_call.not_invoked.json"
    expect_ok: false
    expected_error: "tool_not_invoked"
    provider_modes: ["recorded"]

tool

The tool name the model should call. ReplayCI checks that the model's response contains a tool call matching this name.

assertions

JSON path checks on the model's response. Two types:

  • output_invariants — checks on the model's response (tool call name, arguments)
  • input_invariants — checks on the request sent to the provider (messages array, tools array)
FieldDescription
pathJSON path expression (e.g. $.tool_calls[0].name)
equalsExact value match
existsCheck that the path exists (true/false)
typeExpected type ("array", "object", "string", "number")

golden_cases

Specific input/output pairs to test. Each golden case references a fixture file.

FieldDescription
idUnique identifier for this test case
input_refPath to fixture file (relative to golden directory)
expect_okWhether this case should pass (true) or fail (false)
expected_errorFor failure cases: the expected failure classification
provider_modesRestrict to specific modes: ["recorded"] for offline-only cases

Golden fixture files

A golden fixture is a JSON file containing the request and expected response:

tool_call.success.json
{
  "boundary": {
    "model": "gpt-4o-mini",
    "tool_schema_hash": "starter_tool_001",
    "message_hash": "starter_msg_001",
    "system_hash": "starter_sys_001"
  },
  "request": {
    "model": "gpt-4o-mini",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant..." },
      { "role": "user", "content": "What is the weather in San Francisco?" }
    ],
    "tools": [{ "type": "function", "function": { "name": "get_weather", ... } }]
  },
  "response": {
    "success": true,
    "tool_calls": [
      { "id": "call_001", "type": "function",
        "function": { "name": "get_weather", "arguments": "{\"location\": \"San Francisco\"}" } }
    ],
    "content": null
  }
}
  • boundary — identifies the test context, used for fingerprinting and determinism proof.
  • request — the full request payload sent to the provider (OpenAI chat completions format).
  • response — the expected response. For success: tool_calls array. For failure: a response that violates assertions.

Negative test cases

Negative cases test that ReplayCI correctly detects failures. Place negative fixtures in a negative/ subdirectory.

Example: the model returns text instead of calling the tool:

negative/tool_call.not_invoked.json
{
  "response": {
    "success": true,
    "tool_calls": [],
    "content": "I don't have access to real-time weather data."
  }
}

Mark negative cases with expect_ok: false and expected_error in the contract. The provider_modes: ["recorded"] restriction means this case only runs against recorded fixtures.

Failure classifications

ClassificationMeaning
tool_not_invokedModel returned text instead of calling the tool
malformed_argumentsTool arguments aren't valid JSON
schema_violationArguments don't match the expected schema
wrong_toolModel called a different tool than expected
unexpected_errorProvider returned an error

Each failure also gets a fingerprint — a short hash that identifies the specific failure pattern.

Pack structure

Directory layout
packs/my-pack/
  pack.yaml              # Pack metadata
  nr-allowlist.json      # Non-reproducible allowlist (can be empty: [])
  contracts/
    tool_call.yaml       # Contract files
    search.yaml
  golden/
    tool_call.success.json      # Success fixtures
    search.success.json
    negative/
      tool_call.not_invoked.json   # Negative fixtures

pack.yaml

pack.yaml
pack_id: "my-pack"
name: "my-tool-calling-tests"
version: "0.1.0"
schema_version: "v0.1"

provider: "openai"
default_model: "gpt-4o-mini"

paths:
  contracts_dir: "packs/my-pack/contracts"
  golden_dir: "packs/my-pack/golden"
  negative_golden_dir: "packs/my-pack/golden/negative"

contracts:
  - tool_call.yaml
  - search.yaml

Optional contract fields

Advanced options
timeouts:
  total_ms: 30000

retries:
  max_attempts: 2
  retry_on:
    - "429"
    - "5xx"
    - "timeout"

rate_limits:
  on_429:
    respect_retry_after: true
    max_sleep_seconds: 60
FieldDescription
timeouts.total_msMaximum time for the provider call (default: 30000)
retries.max_attemptsNumber of retries on transient errors
retries.retry_onError types that trigger a retry
rate_limits.on_429How to handle rate limit responses

Running your tests

Terminal
# Using .replayci.yml
npx replayci

# With CLI flag override
npx replayci --pack packs/my-pack

# Against recorded fixtures (offline, deterministic)
npx replayci --pack packs/my-pack --provider recorded

# Against a live provider
npx replayci --pack packs/my-pack --provider openai --model gpt-4o-mini

See CLI Reference for all flags.