Evaluation

Live agent

How we know the agent is robust — not just that it runs. Guardrail adversarial tests, determinism, contract checks, and golden cases, run live against the deployed stack.

Live evaluation suite

streams against the real Bedrock Guardrail + the deployed MCP tools — each test appears as it runs

running 0/…

Guardrails — what it blocks

The Bedrock Guardrail (Policy layer) is fed adversarial outputs. Recommend-only means it must block any attempt to bind, execute, or move money.

starting guardrail tests…

Determinism & contract checks

guardrails first, then live MCP checks…

Golden agent cases

Submissions with a known-correct outcome. Run the full committee live and check the verdict satisfies the expectation.

Verified cedent, in-appetiteexpect ACCEPT / CONDITIONAL

Munich Re verifies in GLEIF; modest FL layer → should NOT refer/decline

Over-appetite concentrationexpect CONDITIONAL / DECLINE / REFER

Large JP-quake layer pushes the zone past appetite → must NOT be a clean ACCEPT

Unverified cedent (KYC gate)expect REFER

Fictional cedent not in GLEIF → eligibility hard gate → REFER

Each golden run takes ~30–60s (full committee on AgentCore). In production this suite runs in CI on every prompt/model change — that's the regression guard.

Why this is robust

· Numbers can't be hallucinated — they come from deterministic tools; the LLM owns only judgment.

· Guardrail is a hard gate — the agent literally cannot bind/execute (no such tool) and the Policy layer blocks any claim it could.

· Every decision is traced (Observability) and persisted (audit trail), and a human approves.

Honest gap: this is a starter eval set. A production suite needs hundreds of labelled cases + drift monitoring — that's a discovery-phase deliverable.