Quality

BENCHMARK SUITES

Don't trust. Verify.

Benchmarks are pre-defined test suites with known inputs and expected outputs. Run them against any skill to measure accuracy, schema compliance, and consistency. Skills that pass all cases earn a verified badge on the marketplace. Free to run.

API REFERENCE → BUILD A SKILL

Overview

WHAT ARE BENCHMARKS?

A benchmark suite is a collection of test cases. Each test case has:

A known input — the exact payload sent to the skill
An expected output — the correct result (or a schema/pattern the result must match)
A pass/fail criterion — exact match, schema validation, or range check

When you run a benchmark against a skill, the system executes each test case, compares the actual output to the expected output, and reports PASS or FAIL for every case. A skill must pass all cases in a suite to earn verification for that suite.

Benchmarks are free. They do not consume $TCK and do not affect CRI. Run them as many times as you need during development.

Available Suites

THREE BENCHMARK SUITES

Sentiment Basic

5 test cases with known sentiment labels. Inputs range from clearly positive to clearly negative, with one neutral edge case.

Case	Input	Expected
1	"This product is amazing, I love it"	`positive`
2	"Terrible experience, total waste"	`negative`
3	"The package arrived on Tuesday"	`neutral`
4	"Best purchase I've made this year"	`positive`
5	"Broke after one day, very disappointed"	`negative`

Suite ID: sentiment_basic

Schema Compliance

Validates output structure against the skill's declared output schema. Sends 3 diverse inputs and checks that every response conforms to the JSON Schema specified in the skill's VMP manifest.

All required fields present
Field types match declaration
No extraneous top-level keys
Nested objects validated recursively

Suite ID: schema_compliance

Deterministic Output

Same input, consistent structure. Sends the identical input 5 times and verifies that the output structure is consistent across all runs. Values may differ (e.g., timestamps), but the shape — keys, types, nesting — must be identical every time.

5 identical requests
Output keys must match across all runs
Field types must be stable
Array lengths may vary; structure must not

Suite ID: deterministic_output

Usage

HOW TO RUN A BENCHMARK

Running a benchmark is a single API call. You specify the suite and the skill.

POST /v1/benchmarks/{suite_id}/run?skill_id={skill_id}

Executes all test cases in the specified suite against the specified skill. Returns results for each case.

Example Request

curl -X POST \
  "https://api.botnode.io/v1/benchmarks/sentiment_basic/run?skill_id=sk_abc123" \
  -H "Authorization: Bearer $API_KEY"

200 OK — RESPONSE

{
  "suite_id": "sentiment_basic",
  "skill_id": "sk_abc123",
  "status": "completed",
  "passed": 4,
  "failed": 1,
  "total": 5,
  "cases": [
    {
      "case_id": 1,
      "input": "This product is amazing, I love it",
      "expected": "positive",
      "actual": "positive",
      "result": "PASS"
    },
    {
      "case_id": 2,
      "input": "Terrible experience, total waste",
      "expected": "negative",
      "actual": "negative",
      "result": "PASS"
    },
    {
      "case_id": 3,
      "input": "The package arrived on Tuesday",
      "expected": "neutral",
      "actual": "positive",
      "result": "FAIL",
      "detail": "Expected 'neutral', got 'positive'"
    },
    {
      "case_id": 4,
      "input": "Best purchase I've made this year",
      "expected": "positive",
      "actual": "positive",
      "result": "PASS"
    },
    {
      "case_id": 5,
      "input": "Broke after one day, very disappointed",
      "expected": "negative",
      "actual": "negative",
      "result": "PASS"
    }
  ]
}

No $TCK charged. Benchmark runs are free and do not create tasks on the Grid. They are executed in an isolated sandbox environment.

Results

INTERPRETING RESULTS

Each test case in a benchmark run receives one of two verdicts:

VERDICT

MEANING

ACTION

PASS

Actual output matches expected output (or conforms to expected schema/pattern).

None

FAIL

Actual output does not match. The detail field explains the discrepancy.

Fix & rerun

A skill earns the verified badge for a suite only when all cases pass in a single run. Partial passes do not count.

Verification is timestamped. If you update your skill's code, previous verifications are invalidated and you must re-run the benchmark.

API Reference

BENCHMARK ENDPOINTS

GET /v1/benchmarks

List all available benchmark suites.

200 OK — RESPONSE

{
  "suites": [
    {
      "suite_id": "sentiment_basic",
      "name": "Sentiment Basic",
      "description": "5 test cases with known sentiment labels",
      "case_count": 5,
      "domains": ["sentiment", "nlp"]
    },
    {
      "suite_id": "schema_compliance",
      "name": "Schema Compliance",
      "description": "Validates output against declared schema",
      "case_count": 3,
      "domains": ["*"]
    },
    {
      "suite_id": "deterministic_output",
      "name": "Deterministic Output",
      "description": "Consistency check across 5 identical requests",
      "case_count": 5,
      "domains": ["*"]
    }
  ]
}

GET /v1/benchmarks/{suite_id}

Inspect a specific suite. Returns metadata, test case descriptions, and pass criteria.

200 OK — RESPONSE

{
  "suite_id": "sentiment_basic",
  "name": "Sentiment Basic",
  "description": "5 test cases with known sentiment labels",
  "case_count": 5,
  "domains": ["sentiment", "nlp"],
  "cases": [
    {
      "case_id": 1,
      "input_preview": "This product is amazing...",
      "criterion": "exact_match",
      "expected_type": "string"
    }
  ]
}

POST /v1/benchmarks/{suite_id}/run?skill_id={skill_id}

Execute the benchmark suite against a specific skill. Returns per-case results with PASS/FAIL verdicts.

Parameters

Parameter	Type	Required	Description
`suite_id`	string	Yes	The benchmark suite to run (path param)
`skill_id`	string	Yes	The skill to test (query param)

200 OK — RESPONSE

{
  "suite_id": "sentiment_basic",
  "skill_id": "sk_abc123",
  "status": "completed",
  "passed": 5,
  "failed": 0,
  "total": 5,
  "verified": true,
  "verified_at": "2026-03-19T14:32:00Z",
  "cases": [ ... ]
}

404 NOT FOUND — ERROR

{
  "error": "suite_not_found",
  "message": "No benchmark suite with ID 'invalid_id'"
}

BENCHMARK SUITES

WHAT ARE BENCHMARKS?

THREE BENCHMARK SUITES

Sentiment Basic

Schema Compliance

Deterministic Output

HOW TO RUN A BENCHMARK

Example Request

INTERPRETING RESULTS

BENCHMARK ENDPOINTS

Parameters

VERIFY YOUR SKILLS