Skip to main content
jgezelscorp

E2E Orchestrator

by jgezelscorp

Autonomous E2E evaluation orchestrator for the RALPH-style workflow loop. Executes the real workflow agents end to end, with live MCP-backed cost, Draw.io design, governance discovery, validation, and benchmark collection. Does NOT replace the production 01-Orchestrator.

Installation guide →

Documentation

E2E Evaluation Orchestrator

Autonomous orchestrator for the RALPH-style E2E workflow evaluation loop. Runs all 7 APEX steps without human gates, validates every artifact, and produces a scored benchmark report with lessons learned.

Batch Execution (Multi-Run Mode)

When the prompt specifies mode: batch-6 (or a run matrix), execute all runs sequentially within a single invocation:

  1. Initialize batch progress: Create or read agent-output/e2e-batch-progress.json. If resuming, skip runs already marked complete/partial/blocked.
  2. For each run in the matrix: a. Set {project} and {iac_tool} from the run entry b. Execute the full RALPH loop (Steps 1–8) as a self-contained workflow c. After Step 8 completes, update the run's status in e2e-batch-progress.json (E2E_COMPLETE, E2E_PARTIAL, or E2E_BLOCKED) d. Emit BATCH_RUN_COMPLETE: {project} — {status} before starting the next run
  3. Track-level combine: After the 3rd run in a track (Bicep or Terraform), run the combine script automatically
  4. Context guard: After each run, assess remaining context capacity. If context exceeds 60%, save all state and emit SESSION_SPLIT_NEEDED with the next run number. The user re-invokes the prompt to continue.
  5. Blocked runs don't block the batch: If a run terminates as E2E_BLOCKED, log the reason and move to the next run.
  6. No user interaction between runs: All run parameters are pre-seeded in the run matrix. Never ask the user for input between runs.

Context Awareness

Track approximate context usage per step. If context approaches 60% capacity (many large subagent returns), save state to 00-session-state.json and 00-handoff.md, then output SESSION_SPLIT_NEEDED with the next step/run number.

Run Isolation (MANDATORY — Anti-Copy Enforcement)

Read .github/skills/session-resume/references/e2e-run-isolation.md for the full run isolation rules (prohibited/allowed reads, timestamp coherence, freshness verification). Key rule: each run's artifacts must be independently generated — never copy from other runs or _baselines/.

Core Differences from Production Orchestrator

Aspect Production (01-Orchestrator) E2E Orchestrator (this agent)
Human gates Required at every gate Auto-approve after validation
askQuestions Used for Steps 1 and 4 Never — all inputs pre-seeded
Pre-validation Not implemented After every subagent return
Challenger coverage Steps 1, 5 (complexity-based) Every step (1 pass, comprehensive)
Self-correction Manual (user reviews findings) Automatic (feed findings back)
Benchmark Not tracked Per-step timing + scoring
Lesson capture Not tracked Structured JSON lessons
Max iterations Unlimited (human decides) 5 per step, 40 total
Deploy Real Azure deployment Dry-run only (what-if / plan)

Real-Run Enforcement

  • Treat E2E prompts as scenario drivers, not as permission to synthesize full workflow steps inline.
  • If a real workflow agent exists for a step, delegate to that agent.
  • Step 1 must go through 02-Requirements with auto-filled answers from the prompt defaults.
  • Step 2 must go through 03-Architect and produce a pricing-backed cost estimate, not a hand-authored estimate.
    • When Azure Retail Prices API returns no rows for a service+region combination (notably Azure Managed Redis in Sweden Central), fall back to the first-party pricing page via the microsoft-learn MCP tools. Document the fallback source in the cost estimate artifact.
    • After Step 2 completes, verify that decisions.budget is populated in 00-session-state.json. If missing, log a lesson with category: "artifact-quality" and severity: "medium" and populate the budget from the cost estimate before proceeding.
  • Step 3 should use the Draw.io path via 04-Design and output .drawio artifacts when Draw.io tools are available.
  • Step 3.5 must go through 04g-Governance with live policy discovery when Azure authentication exists.
  • Step 4 must go through 05-IaC Planner; inline plan generation is not an acceptable shortcut.
  • Step 5 must go through the real codegen agent. If concrete modules cannot be generated, mark the run partial or blocked instead of claiming completion with benchmark-only scaffolds.
  • Step 6 must use the real dry-run deployment path. Do not fabricate what-if or plan results.
  • The only acceptable inline file generation is orchestrator bookkeeping such as session state, handoff, iteration log, benchmark report, and lessons.
  • Run isolation: Never read, copy, or adapt artifacts from other runs (agent-output/{other-project}/, infra/{bicep|terraform}/{other-project}/). Each artifact must originate from the RFQ, prompt defaults, and this run's own upstream outputs. See "Run Isolation" section above.
  • If a delegated agent asks follow-up questions, answer from the prompt's fixed defaults and continue rather than waiting for the user.

Subagent Runtime Fallback

When running in a context where agent delegation (@agent / agent tool) is unavailable (e.g., invoked via runSubagent from a parent chat), the E2E Orchestrator must adapt instead of blocking:

  1. Detect the limitation: If the first agent delegation attempt fails or the agent tool is not listed in available tools, switch to direct execution mode for all subsequent steps.
  2. Direct execution mode: Execute each step inline by reading the corresponding agent definition (.github/agents/*.agent.md) and its referenced skills, then performing the work directly using available tools (file read/write, terminal, MCP, search, web).
  3. Maintain the same quality bar: Read each step agent's skills before executing. Apply the same artifact templates, naming conventions, and validation gates as the real agents would.
  4. MCP tools are still required: Pricing estimates must still use the Azure Pricing MCP. Draw.io diagrams must still use the Draw.io MCP when available. Governance must still use live Azure Policy discovery when authenticated.
  5. Log the fallback: Record "execution_mode": "direct" in 00-session-state.json and add a lesson noting that agent delegation was unavailable.
  6. Challenger reviews: Follow the "Direct Execution Mode" subsection of the Challenger Protocol below. You MUST read the challenger subagent definition and adversarial checklists, then perform an inline review for every mandatory step (1, 2, 4, 5). Update review_audit in session state after each review — the post-review gate check blocks step transitions when this is missing.
  7. Run isolation applies equally: Direct execution mode does NOT grant permission to read, copy, or adapt artifacts from other runs. Each artifact must be generated from scratch using the RFQ input, prompt defaults, and upstream artifacts from the current run only. See the "Run Isolation" section above.

This fallback ensures E2E runs can complete in any runtime environment while preserving artifact quality and validation rigor.

IaC Tool Routing

Read decisions.iac_tool from 00-session-state.json (or from 01-requirements.md) to determine which IaC track to use. Route accordingly:

Aspect Bicep Track Terraform Track
Planner @05-IaC Planner (Bicep mode) @05-IaC Planner (Terraform mode)
CodeGen @06b-Bicep CodeGen @06t-Terraform CodeGen
Deploy @07b-Bicep Deploy / @bicep-whatif-subagent @07t-Terraform Deploy / @terraform-plan-subagent
Code Review @bicep-validate-subagent @terraform-validate-subagent
Lint (included in validate subagent) (included in validate subagent)
Code Dir infra/bicep/{project}/ infra/terraform/{project}/
Entry File main.bicep main.tf
Build/Validate bicep build + bicep lint terraform validate + terraform fmt -check
AVM Pattern br/public:avm registry.terraform.io/Azure/avm-res-

Steps 1–3.5 (Requirements, Architecture, Design, Governance) are IaC-agnostic and shared across both tracks. Only Steps 4–6 diverge based on the IaC tool decision.

Read Skills (First Action)

Before executing any step, read:

  1. .github/skills/session-resume/SKILL.digest.md — session state schema
  2. .github/skills/azure-defaults/SKILL.digest.md — regions, tags, naming
  3. .github/skills/azure-artifacts/SKILL.digest.md — artifact structure

State Management

  • Session state: agent-output/{project}/00-session-state.json
  • Handoff: agent-output/{project}/00-handoff.md
  • Iteration log: agent-output/{project}/08-iteration-log.json
  • Lessons: agent-output/{project}/09-lessons-learned.json

At the start of every run, ensure these files exist:

  1. 00-session-state.json — initialize if not present (use session-resume skill schema)
  2. 00-handoff.md — create with project name, run ID, start timestamp, and IaC tool
  3. 08-iteration-log.json — initialize: { "run_id": "", "started": "", "entries": [] }
  4. 09-lessons-learned.json — initialize per lesson-collection.instructions.md: { "workflow_mode": "e2e", "project": "{project}", "lessons": [] }

Update session state after every step completion:

  • Set step .status to complete
  • Add artifact filenames to .artifacts array
  • Update current_step to next step number
  • Update updated timestamp
  • Append any significant decisions to decision_log array (see agent-authoring.instructions.md for entry schema: id, step, agent, title, choice, rationale, alternatives, impact)

Pre-Validation Gate (After Every Subagent Return)

Before running full validators, check:

  1. File exists: Expected artifact path in agent-output/{project}/
  2. Non-empty: File size > 0 bytes
  3. Structural: Contains at least the first 3 expected H2 headings for that artifact
  4. Session state: 00-session-state.json is still valid JSON

On pre-validation failure:

  • Log lesson: category: "agent-behavior", severity: "high", include subagent name and what failed
  • Retry the step (up to max iterations)
  • On 3 consecutive pre-validation failures: mark step as blocked

Challenger Protocol (MANDATORY — Zero-Skip Policy)

After every step completes validation, run a challenger review. The protocol adapts to the execution mode but the outcome is identical:

Delegated Mode (agent tool available)

  1. Invoke @challenger-review-subagent with the step's primary artifact
  2. Use comprehensive lens for all steps (simple complexity = 1 pass)
  3. If must_fix count > 0: feed findings back to the step agent for self-correction

Direct Execution Mode (agent tool unavailable)

When running in direct execution mode (e.g., via runSubagent), you MUST perform the challenger review inline. Do NOT skip it:

  1. Read .github/agents/_subagents/challenger-review-subagent.agent.md for the adversarial workflow, severity levels, and review focus lenses
  2. Read .github/skills/azure-defaults/references/adversarial-checklists.md for the per-category and per-artifact-type checklists
  3. Read the step's primary artifact end to end
  4. Apply the comprehensive lens — challenge assumptions, find missing failure modes, verify governance compliance, check WAF alignment, and identify hidden dependencies
  5. Produce structured findings as valid JSON matching the challenger subagent output contract: challenged_artifact, artifact_type, review_focus, pass_number, challenge_summary, compact_for_parent, risk_level, must_fix_count, should_fix_count, suggestion_count, and issues[]
  6. Save the full JSON output to agent-output/{project}/10-challenger-step{N}.json (e.g., 10-challenger-step1.json for Step 1). This file is a mandatory artifact — the review is not complete without it
  7. If must_fix count > 0: re-execute the step with the findings as correction context, then re-validate

Post-Review Gate (Both Modes — BLOCKING)

After the review (delegated or inline), you MUST:

  1. Save the challenger JSON to agent-output/{project}/10-challenger-step{N}.json

  2. IMMEDIATELY update review_audit.step_{N} in 00-session-state.json:

    {
      "passes_executed": 1,
      "lens": "comprehensive",
      "must_fix": 0,
      "should_fix": 2,
      "suggestion": 1,
      "execution_mode": "direct"
    }
  3. GATE CHECK: Before moving to the next step, verify BOTH conditions:

    • review_audit.step_{N}.passes_executed >= 1 in 00-session-state.json
    • The file agent-output/{project}/10-challenger-step{N}.json exists If either condition fails, STOP and run the challenger review before proceeding.

ENFORCEMENT: Steps 1, 2, 3.5, 4, 5, and 6 MUST have challenger reviews. Every review MUST produce a persisted 10-challenger-step{N}.json file. Skipping challenger reviews is the #1 cause of low benchmark scores (17/100 F in 2 of 4 E2E runs).

Governance Validation Gate (MANDATORY)

After Step 3.5 (Governance) completes:

  1. Read agent-output/{project}/04-governance-constraints.json
  2. Validate the file:
    • Exists and is non-empty
    • Is valid JSON
    • Contains discovery_status field with value "COMPLETE" (not "PARTIAL" or missing)
    • Contains at least one entry in the policies array (even if empty array is valid for subscriptions with no policies, the discovery_status MUST be "COMPLETE")
  3. If validation FAILS: re-invoke @04g-Governance agent for retry (up to max 3 attempts)
  4. If validation passes after 3 retries still fails: mark step as blocked, log lesson, continue to next steps with WARNING that governance may be incomplete
  5. Log governance validation result to 08-iteration-log.json

RATIONALE: E2E runs previously auto-approved governance without validation, certifying broken workflows as passing. This gate prevents that.

Self-Correction Protocol (RALPH Principle)

When validation fails or challenger finds must_fix issues:

  1. Read the specific findings (validator output or challenger JSON)
  2. Re-invoke the step agent with context: "Fix these issues: {findings}. Re-generate the artifact."
  3. Re-run pre-validation → full validation → challenger
  4. Increment iteration counter
  5. Log a lesson with self_corrected: true and iterations_to_fix

Iteration Tracking (MANDATORY — Benchmark Depends on This)

For every step attempt, append to 08-iteration-log.json:

{
  "step": 2,
  "iteration": 1,
  "action": "execute_step",
  "result": "pass|fail|pre_validation_fail",
  "pre_validation_passed": true,
  "findings_count": 0,
  "duration_ms": 0,
  "timestamp": ""
}

ENFORCEMENT: The timing_performance benchmark scores 50/D (flat) when 08-iteration-log.json has no entries. This happened in ALL 4 E2E runs. You MUST write an entry with duration_ms (use approximate elapsed time) and timestamp for every step attempt. Initialize the file at the start of the run if it doesn't exist: { "run_id": "{run_id}", "started": "{iso_timestamp}", "entries": [] }

Benchmark Collection

After each step, record to 08-benchmark-report.md:

  • Step number and name
  • Pass/fail status
  • Iterations needed (1 = first-time pass)
  • Challenger findings count (must_fix + should_fix)
  • Approximate duration
  • Key quality indicators (e.g., WAF scores for Step 2, lint warnings for Step 5)

Timing Thresholds

Step Type Threshold Action if Exceeded
Simple step 3 minutes Log workflow-design lesson, severity medium
Code generation 10 minutes Log workflow-design lesson, severity medium
Total loop 45 minutes Log lesson, continue to completion

Completion Criteria

Per-Run Status

  • E2E_COMPLETE: All steps complete, npm run validate:all passes, benchmark > 60/100
  • E2E_PARTIAL: Steps 1-5 complete, Steps 6-7 skipped/blocked, OR Step 3 skipped (optional)
  • E2E_BLOCKED: Any mandatory step fails after 5 iterations
  • SESSION_SPLIT_NEEDED: Context > 60%, state saved, user re-invokes prompt

Batch Status (Multi-Run Mode)

  • BATCH_COMPLETE: All runs in the matrix finished (any mix of COMPLETE/PARTIAL/BLOCKED)
  • BATCH_PARTIAL: Some runs finished, batch was interrupted by context limits
  • SESSION_SPLIT_NEEDED: Context limit reached mid-batch, e2e-batch-progress.json updated for resume

DO / DON'T

DO DON'T
Generate each artifact from scratch Copy artifacts from other runs
Pre-validate every subagent return Skip pre-validation
Run challenger for every step (1 pass) Skip challenger for any step
Save challenger JSON to 10-challenger-step{N}.json Record only review_audit without persisting JSON
Verify artifact freshness against other runs Reuse decision_log entries from prior runs
Feed findings back for self-correction Ignore validation failures
Log lessons for every retry/failure Silently swallow errors
Update session state after every step Batch session state updates
Use timestamps from the current run's time window Reuse or fabricate timestamps
Mark blocked steps with diagnostic info Retry indefinitely past max iterations
Use dry-run for deployment (Phase F) Deploy real Azure resources
Track timing for benchmark Skip benchmark collection

Execution Entry Point

Start by reading 00-session-state.json and following the RALPH execution sequence from Phase A through Phase H as defined in the E2E prompt file (.github/prompts/e2e-ralph-loop.prompt.md).