Writing Tests — ReplayCI

Contract structure

A contract has three parts: the tool name, assertions, and golden cases.

            tool_call.yaml
          

            # Does my AI call the right tool?

tool: "get_weather"

assertions:
  output_invariants:
    - path: "$.tool_calls[0].name"
      equals: "get_weather"
    - path: "$.tool_calls[0].arguments"
      exists: true

golden_cases:
  - id: "tool_call_success"
    input_ref: "tool_call.success.json"
    expect_ok: true

  - id: "tool_not_invoked"
    input_ref: "negative/tool_call.not_invoked.json"
    expect_ok: false
    expected_error: "tool_not_invoked"
    provider_modes: ["recorded"]
          

tool

The tool name the model should call. ReplayCI checks that the model's response contains a tool call matching this name.

assertions

JSON path checks on the model's response. Two types:

output_invariants — checks on the model's response (tool call name, arguments)
input_invariants — checks on the request sent to the provider (messages array, tools array)

Field	Description
`path`	JSON path expression (e.g. `$.tool_calls[0].name`)
`equals`	Exact value match
`exists`	Check that the path exists (true/false)
`type`	Expected type (`"array"`, `"object"`, `"string"`, `"number"`)

golden_cases

Specific input/output pairs to test. Each golden case references a fixture file.

Field	Description
`id`	Unique identifier for this test case
`input_ref`	Path to fixture file (relative to golden directory)
`expect_ok`	Whether this case should pass (`true`) or fail (`false`)
`expected_error`	For failure cases: the expected failure classification
`provider_modes`	Restrict to specific modes: `["recorded"]` for offline-only cases

Golden fixture files

A golden fixture is a JSON file containing the request and expected response:

            tool_call.success.json
          

            {
  "boundary": {
    "model": "gpt-4o-mini",
    "tool_schema_hash": "starter_tool_001",
    "message_hash": "starter_msg_001",
    "system_hash": "starter_sys_001"
  },
  "request": {
    "model": "gpt-4o-mini",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant..." },
      { "role": "user", "content": "What is the weather in San Francisco?" }
    ],
    "tools": [{ "type": "function", "function": { "name": "get_weather", ... } }]
  },
  "response": {
    "success": true,
    "tool_calls": [
      { "id": "call_001", "type": "function",
        "function": { "name": "get_weather", "arguments": "{\"location\": \"San Francisco\"}" } }
    ],
    "content": null
  }
}
          

boundary — identifies the test context, used for fingerprinting and determinism proof.
request — the full request payload sent to the provider (OpenAI chat completions format).
response — the expected response. For success: tool_calls array. For failure: a response that violates assertions.

Negative test cases

Negative cases test that ReplayCI correctly detects failures. Place negative fixtures in a negative/ subdirectory.

Example: the model returns text instead of calling the tool:

            negative/tool_call.not_invoked.json
          

            {
  "response": {
    "success": true,
    "tool_calls": [],
    "content": "I don't have access to real-time weather data."
  }
}
          

Mark negative cases with expect_ok: false and expected_error in the contract. The provider_modes: ["recorded"] restriction means this case only runs against recorded fixtures.

Failure classifications

Classification	Meaning
`tool_not_invoked`	Model returned text instead of calling the tool
`malformed_arguments`	Tool arguments aren't valid JSON
`schema_violation`	Arguments don't match the expected schema
`wrong_tool`	Model called a different tool than expected
`unexpected_error`	Provider returned an error

Each failure also gets a fingerprint — a short hash that identifies the specific failure pattern.

Pack structure

            Directory layout
          

            packs/my-pack/
  pack.yaml              # Pack metadata
  nr-allowlist.json      # Non-reproducible allowlist (can be empty: [])
  contracts/
    tool_call.yaml       # Contract files
    search.yaml
  golden/
    tool_call.success.json      # Success fixtures
    search.success.json
    negative/
      tool_call.not_invoked.json   # Negative fixtures
          

pack.yaml

            pack.yaml
          

            pack_id: "my-pack"
name: "my-tool-calling-tests"
version: "0.1.0"
schema_version: "v0.1"

provider: "openai"
default_model: "gpt-4o-mini"

paths:
  contracts_dir: "packs/my-pack/contracts"
  golden_dir: "packs/my-pack/golden"
  negative_golden_dir: "packs/my-pack/golden/negative"

contracts:
  - tool_call.yaml
  - search.yaml
          

Optional contract fields

            Advanced options
          

            timeouts:
  total_ms: 30000

retries:
  max_attempts: 2
  retry_on:
    - "429"
    - "5xx"
    - "timeout"

rate_limits:
  on_429:
    respect_retry_after: true
    max_sleep_seconds: 60
          

Field	Description
`timeouts.total_ms`	Maximum time for the provider call (default: 30000)
`retries.max_attempts`	Number of retries on transient errors
`retries.retry_on`	Error types that trigger a retry
`rate_limits.on_429`	How to handle rate limit responses

Running your tests

            Terminal
          

            # Using .replayci.yml
npx replayci

# With CLI flag override
npx replayci --pack packs/my-pack

# Against recorded fixtures (offline, deterministic)
npx replayci --pack packs/my-pack --provider recorded

# Against a live provider
npx replayci --pack packs/my-pack --provider openai --model gpt-4o-mini
          

See CLI Reference for all flags.