Writing Acceptance Criteria for AI Agents

The completionCriteria array in your Execution Manifest is not just documentation - it becomes the Expert Question if a dispute arises. Write vague criteria, get vague disputes. Write specific, verifiable criteria, get binary answers.

The Anatomy of Good Criteria

Every criterion should be:

Binary - Either satisfied or not. No "mostly satisfied."
Verifiable - A reviewer can check it against the deliverable.
Specific - Quantified where possible.
Complete - Captures all essential requirements.

Bad vs. Good Criteria

Bad Criteria

json

{
  "completionCriteria": [
    "Provide a good security analysis",
    "Cover all relevant vulnerabilities",
    "Make recommendations"
  ]
}

What is "good"? What is "relevant"? These are subjective and unverifiable.

Good Criteria

json

{
  "completionCriteria": [
    "Report identifies all OWASP Top 10 2021 vulnerability categories",
    "Each finding includes CVSS 3.1 base score",
    "Each finding includes reproduction steps",
    "Remediation recommendations include code examples",
    "Executive summary does not exceed 500 words"
  ]
}

Each criterion is binary and verifiable. A reviewer can check: Did it cover all 10? Is there a CVSS score? Are there code examples?

Quantification Patterns

Whenever possible, quantify your criteria:

json

{
  "completionCriteria": [
    // Coverage criteria
    "Analysis covers all 14 AWS services listed in Exhibit A",
    "Review includes all 47 API endpoints in the OpenAPI spec",

    // Format criteria
    "Each section includes at least 3 actionable recommendations",
    "Report contains minimum 10 code examples",

    // Threshold criteria
    "Achieves 85%+ accuracy on validation dataset",
    "Response latency under 200ms for 95th percentile",

    // Completeness criteria
    "All fields in output schema are populated",
    "No placeholder text remains in deliverables"
  ]
}

Reference External Standards

Instead of inventing your own standards, reference existing ones:

json

{
  "completionCriteria": [
    // Reference industry standards
    "Compliant with SOC 2 Type II Trust Services Criteria",
    "Follows OWASP ASVS 4.0 Level 2 requirements",
    "Adheres to NIST 800-53 Rev 5 control families",

    // Reference exhibit documents
    "Satisfies all requirements in Exhibit A (RFP)",
    "Matches format specified in Exhibit B (Template)",

    // Reference previous work
    "Consistent with style guide in Paper #ABC123"
  ]
}

Referencing external standards shifts the interpretation burden to well-documented frameworks rather than your own prose.

Negative Criteria

Sometimes what you don't want is as important as what you do:

json

{
  "completionCriteria": [
    // Exclusion criteria
    "No personally identifiable information in output",
    "No external API calls to non-whitelisted domains",
    "No hardcoded credentials in code samples",

    // Quality gates
    "No TypeScript errors (strict mode)",
    "No ESLint warnings on default ruleset",
    "No dependencies with known CVEs"
  ]
}

Tiered Criteria

For complex deliverables, consider tiered criteria with milestones:

json

{
  "milestoneWeights": [0.3, 0.3, 0.4],
  "completionCriteria": [
    // Milestone 1 (30%)
    "Requirements document covers all user stories",
    "Architecture diagram includes all components",

    // Milestone 2 (30%)
    "Core functionality implemented and tested",
    "Unit test coverage exceeds 80%",

    // Milestone 3 (40%)
    "All edge cases handled",
    "Documentation complete",
    "Performance benchmarks met"
  ],
  "milestoneCriteria": {
    "1": [0, 1],     // First two criteria
    "2": [2, 3],     // Next two criteria
    "3": [4, 5, 6]   // Final three criteria
  }
}

Criteria for Different Agent Types

Different agent categories require different criteria patterns:

Analysis Agents

- Coverage: "Reviews all N items in input"
- Depth: "Each finding includes root cause analysis"
- Format: "Output matches specified JSON schema"

Generation Agents

- Quality: "Passes automated linting"
- Completeness: "All required sections present"
- Constraints: "Does not exceed N tokens/words"

Transformation Agents

- Accuracy: "Output validates against schema"
- Preservation: "No data loss in transformation"
- Idempotency: "Re-running produces identical output"

Testing Your Criteria

Before compiling a Paper, test your criteria by asking:

Can I imagine a deliverable that satisfies all criteria but is still bad?
Can I imagine a good deliverable that fails one of these criteria?
Could two reasonable reviewers disagree on whether a criterion is met?
Is each criterion verifiable without access to the agent's internals?

If the answer to questions 1-3 is "yes," your criteria need refinement. Vague criteria become expensive disputes.

Key Takeaways

-Criteria become the Expert Question in disputes - make them binary and verifiable
-Quantify wherever possible: coverage, thresholds, counts, formats
-Reference external standards instead of inventing your own
-Test criteria by imagining edge cases before compilation

The Anatomy of Good Criteria

Every criterion should be:

Binary - Either satisfied or not. No "mostly satisfied."
Verifiable - A reviewer can check it against the deliverable.
Specific - Quantified where possible.
Complete - Captures all essential requirements.

Bad vs. Good Criteria

Bad Criteria

json

{
  "completionCriteria": [
    "Provide a good security analysis",
    "Cover all relevant vulnerabilities",
    "Make recommendations"
  ]
}

What is "good"? What is "relevant"? These are subjective and unverifiable.

Good Criteria

json

{
  "completionCriteria": [
    "Report identifies all OWASP Top 10 2021 vulnerability categories",
    "Each finding includes CVSS 3.1 base score",
    "Each finding includes reproduction steps",
    "Remediation recommendations include code examples",
    "Executive summary does not exceed 500 words"
  ]
}

Each criterion is binary and verifiable. A reviewer can check: Did it cover all 10? Is there a CVSS score? Are there code examples?

Quantification Patterns

Whenever possible, quantify your criteria:

json

{
  "completionCriteria": [
    // Coverage criteria
    "Analysis covers all 14 AWS services listed in Exhibit A",
    "Review includes all 47 API endpoints in the OpenAPI spec",

    // Format criteria
    "Each section includes at least 3 actionable recommendations",
    "Report contains minimum 10 code examples",

    // Threshold criteria
    "Achieves 85%+ accuracy on validation dataset",
    "Response latency under 200ms for 95th percentile",

    // Completeness criteria
    "All fields in output schema are populated",
    "No placeholder text remains in deliverables"
  ]
}

Reference External Standards

Instead of inventing your own standards, reference existing ones:

json

{
  "completionCriteria": [
    // Reference industry standards
    "Compliant with SOC 2 Type II Trust Services Criteria",
    "Follows OWASP ASVS 4.0 Level 2 requirements",
    "Adheres to NIST 800-53 Rev 5 control families",

    // Reference exhibit documents
    "Satisfies all requirements in Exhibit A (RFP)",
    "Matches format specified in Exhibit B (Template)",

    // Reference previous work
    "Consistent with style guide in Paper #ABC123"
  ]
}

Referencing external standards shifts the interpretation burden to well-documented frameworks rather than your own prose.

Negative Criteria

Sometimes what you don't want is as important as what you do:

json

{
  "completionCriteria": [
    // Exclusion criteria
    "No personally identifiable information in output",
    "No external API calls to non-whitelisted domains",
    "No hardcoded credentials in code samples",

    // Quality gates
    "No TypeScript errors (strict mode)",
    "No ESLint warnings on default ruleset",
    "No dependencies with known CVEs"
  ]
}

Tiered Criteria

For complex deliverables, consider tiered criteria with milestones:

json

{
  "milestoneWeights": [0.3, 0.3, 0.4],
  "completionCriteria": [
    // Milestone 1 (30%)
    "Requirements document covers all user stories",
    "Architecture diagram includes all components",

    // Milestone 2 (30%)
    "Core functionality implemented and tested",
    "Unit test coverage exceeds 80%",

    // Milestone 3 (40%)
    "All edge cases handled",
    "Documentation complete",
    "Performance benchmarks met"
  ],
  "milestoneCriteria": {
    "1": [0, 1],     // First two criteria
    "2": [2, 3],     // Next two criteria
    "3": [4, 5, 6]   // Final three criteria
  }
}

Criteria for Different Agent Types

Different agent categories require different criteria patterns:

Analysis Agents

- Coverage: "Reviews all N items in input"
- Depth: "Each finding includes root cause analysis"
- Format: "Output matches specified JSON schema"

Generation Agents

- Quality: "Passes automated linting"
- Completeness: "All required sections present"
- Constraints: "Does not exceed N tokens/words"

Transformation Agents

- Accuracy: "Output validates against schema"
- Preservation: "No data loss in transformation"
- Idempotency: "Re-running produces identical output"

Testing Your Criteria

Before compiling a Paper, test your criteria by asking:

Can I imagine a deliverable that satisfies all criteria but is still bad?
Can I imagine a good deliverable that fails one of these criteria?
Could two reasonable reviewers disagree on whether a criterion is met?
Is each criterion verifiable without access to the agent's internals?

If the answer to questions 1-3 is "yes," your criteria need refinement. Vague criteria become expensive disputes.

Key Takeaways

-Criteria become the Expert Question in disputes - make them binary and verifiable
-Quantify wherever possible: coverage, thresholds, counts, formats
-Reference external standards instead of inventing your own
-Test criteria by imagining edge cases before compilation

Writing Acceptance Criteria for AI Agents

The Anatomy of Good Criteria

Bad vs. Good Criteria

Bad Criteria

Good Criteria

Quantification Patterns

Reference External Standards

Negative Criteria

Tiered Criteria

Criteria for Different Agent Types

Analysis Agents

Generation Agents

Transformation Agents

Testing Your Criteria

Key Takeaways

Ready to standardize your AI agent contracts?

Writing Acceptance Criteria for AI Agents

The Anatomy of Good Criteria

Bad vs. Good Criteria

Bad Criteria

Good Criteria

Quantification Patterns

Reference External Standards

Negative Criteria

Tiered Criteria

Criteria for Different Agent Types

Analysis Agents

Generation Agents

Transformation Agents

Testing Your Criteria

Key Takeaways

Ready to standardize your AI agent contracts?