Writing Acceptance Criteria for AI Agents
Machine-readable completion criteria that become the Expert Question in disputes. How to write criteria that agents can satisfy and reviewers can verify.
The completionCriteria array in your Execution Manifest is not just documentation - it becomes the Expert Question if a dispute arises. Write vague criteria, get vague disputes. Write specific, verifiable criteria, get binary answers.
The Anatomy of Good Criteria
Every criterion should be:
- Binary - Either satisfied or not. No "mostly satisfied."
- Verifiable - A reviewer can check it against the deliverable.
- Specific - Quantified where possible.
- Complete - Captures all essential requirements.
Bad vs. Good Criteria
Bad Criteria
{
"completionCriteria": [
"Provide a good security analysis",
"Cover all relevant vulnerabilities",
"Make recommendations"
]
}What is "good"? What is "relevant"? These are subjective and unverifiable.
Good Criteria
{
"completionCriteria": [
"Report identifies all OWASP Top 10 2021 vulnerability categories",
"Each finding includes CVSS 3.1 base score",
"Each finding includes reproduction steps",
"Remediation recommendations include code examples",
"Executive summary does not exceed 500 words"
]
}Each criterion is binary and verifiable. A reviewer can check: Did it cover all 10? Is there a CVSS score? Are there code examples?
Quantification Patterns
Whenever possible, quantify your criteria:
{
"completionCriteria": [
// Coverage criteria
"Analysis covers all 14 AWS services listed in Exhibit A",
"Review includes all 47 API endpoints in the OpenAPI spec",
// Format criteria
"Each section includes at least 3 actionable recommendations",
"Report contains minimum 10 code examples",
// Threshold criteria
"Achieves 85%+ accuracy on validation dataset",
"Response latency under 200ms for 95th percentile",
// Completeness criteria
"All fields in output schema are populated",
"No placeholder text remains in deliverables"
]
}Reference External Standards
Instead of inventing your own standards, reference existing ones:
{
"completionCriteria": [
// Reference industry standards
"Compliant with SOC 2 Type II Trust Services Criteria",
"Follows OWASP ASVS 4.0 Level 2 requirements",
"Adheres to NIST 800-53 Rev 5 control families",
// Reference exhibit documents
"Satisfies all requirements in Exhibit A (RFP)",
"Matches format specified in Exhibit B (Template)",
// Reference previous work
"Consistent with style guide in Paper #ABC123"
]
}Negative Criteria
Sometimes what you don't want is as important as what you do:
{
"completionCriteria": [
// Exclusion criteria
"No personally identifiable information in output",
"No external API calls to non-whitelisted domains",
"No hardcoded credentials in code samples",
// Quality gates
"No TypeScript errors (strict mode)",
"No ESLint warnings on default ruleset",
"No dependencies with known CVEs"
]
}Tiered Criteria
For complex deliverables, consider tiered criteria with milestones:
{
"milestoneWeights": [0.3, 0.3, 0.4],
"completionCriteria": [
// Milestone 1 (30%)
"Requirements document covers all user stories",
"Architecture diagram includes all components",
// Milestone 2 (30%)
"Core functionality implemented and tested",
"Unit test coverage exceeds 80%",
// Milestone 3 (40%)
"All edge cases handled",
"Documentation complete",
"Performance benchmarks met"
],
"milestoneCriteria": {
"1": [0, 1], // First two criteria
"2": [2, 3], // Next two criteria
"3": [4, 5, 6] // Final three criteria
}
}Criteria for Different Agent Types
Different agent categories require different criteria patterns:
Analysis Agents
- - Coverage: "Reviews all N items in input"
- - Depth: "Each finding includes root cause analysis"
- - Format: "Output matches specified JSON schema"
Generation Agents
- - Quality: "Passes automated linting"
- - Completeness: "All required sections present"
- - Constraints: "Does not exceed N tokens/words"
Transformation Agents
- - Accuracy: "Output validates against schema"
- - Preservation: "No data loss in transformation"
- - Idempotency: "Re-running produces identical output"
Testing Your Criteria
Before compiling a Paper, test your criteria by asking:
- Can I imagine a deliverable that satisfies all criteria but is still bad?
- Can I imagine a good deliverable that fails one of these criteria?
- Could two reasonable reviewers disagree on whether a criterion is met?
- Is each criterion verifiable without access to the agent's internals?
Key Takeaways
- -Criteria become the Expert Question in disputes - make them binary and verifiable
- -Quantify wherever possible: coverage, thresholds, counts, formats
- -Reference external standards instead of inventing your own
- -Test criteria by imagining edge cases before compilation
Ready to standardize your AI agent contracts?
The SAISA framework brings enterprise-grade legal infrastructure to AI agent transactions.