SJ
Evaluation
Live agentHow we know the agent is robust — not just that it runs. Guardrail adversarial tests, determinism, contract checks, and golden cases, run live against the deployed stack.
Guardrails — what it blocks
The Bedrock Guardrail (Policy layer) is fed adversarial outputs. Recommend-only means it must block any attempt to bind, execute, or move money.
starting guardrail tests…
Determinism & contract checks
guardrails first, then live MCP checks…
Golden agent cases
Submissions with a known-correct outcome. Run the full committee live and check the verdict satisfies the expectation.
Each golden run takes ~30–60s (full committee on AgentCore). In production this suite runs in CI on every prompt/model change — that's the regression guard.