Image credits

March 24, 2026

Why We Built a Workflow-Governed Agent System for Recruiting Automation

Putu

There is now a fairly standard implementation pattern behind many so-called “AI agents”: the model receives context, decides whether to invoke a tool, observes the tool result, and repeats until it decides the task is complete. OpenAI’s Agents SDK describes this explicitly as an agent loop, and Anthropic distinguishes these model-directed systems from workflows, where execution paths are predefined in code.

That distinction is important for production systems, because “agentic” and “autonomous” are not interchangeable. A system can use LLM-based routing, reasoning, decomposition, and synthesis while still keeping control flow, state transitions, and side effects outside model-directed execution.

That is the architecture we chose for recruiting automation.

Instead of pursuing a fully autonomous agent, we chose to build a workflow-governed agent system: a multi-stage architecture in which specialized LLM calls perform bounded cognitive tasks, while orchestration, sequencing, and mutation of system state remain code-controlled. This design is closer to what Berkeley AI Research describes as a compound AI system than to a single autonomous agent.

Architecture overview

Our runtime is organized into three stages:

parallel classification
domain execution
response synthesis

Each stage uses LLMs differently, and each stage has a narrow contract.

1) Parallel classification layer

The first layer performs concurrent inference across several specialized classifiers: safety, jailbreak detection, language identification, retrieval eligibility, intent routing, and workflow classification.

This layer is best understood as a combination of routing and parallelization in Anthropic’s workflow taxonomy. The model helps classify the request, but it does not dynamically construct arbitrary downstream plans. The available execution paths are predefined, typed, and enforced by code.

Two engineering properties matter here:

Latency isolation. Since these classifiers run concurrently, end-to-end latency is closer to the slowest classifier than to the sum of all classifier runtimes.

Error isolation. Each classifier has a narrow prompt surface and a single responsibility, which makes failures easier to detect, evaluate, and retrain than a monolithic orchestration prompt.

2) Domain execution layer

After classification, a code-level dispatcher selects the relevant domain handlers.

This layer reflects our clearest design choice: keeping execution bounded rather than model-directed. The model is not deciding, step by step, which tool to call next in an open loop. Instead, execution follows a constrained graph:

read-only operations may fan out in parallel
state-mutating operations execute sequentially
side effects are performed through code-governed handlers
outputs are written into typed data structures rather than appended to an unconstrained reasoning transcript

This makes execution semantics more explicit and easier to inspect. The system does not rely on the model to remember hidden state, infer permissible transitions, or decide whether certain checks are required.

For recruiting workflows, that matters because the runtime is not just answering questions. It is also participating in a stateful business process: collecting required information, validating completion criteria, applying stage logic, and triggering downstream actions.

3) Response synthesis layer

The final stage takes structured outputs from prior stages and renders a user-facing response.

This stage is intentionally separated from execution. Its role is linguistic, not operational: translate workflow state into a clear conversational response; adapt tone and phrasing; preserve multilingual quality; and explain next steps.

It does not:

choose new execution paths
mutate workflow state
reinterpret transition rules
bypass required process steps

That separation of concerns is one of the main advantages of the architecture. It allows the system to benefit from LLM fluency without giving the response model authority over control flow.

Why we did not use a fully autonomous agent loop

The main reason is that for the kinds of recruiting workflows we care about, unconstrained model-directed execution introduces the wrong tradeoffs.

1) Process fidelity is more important than conversational initiative

In a general assistant, initiative can be a feature. In recruiting, it can become a defect.

A screening flow often contains required questions, specific evaluation steps, mandatory disclosures, and deterministic transition conditions. A fully autonomous agent may infer that a question is redundant because the candidate already mentioned something adjacent to it. That may be conversationally efficient, but it can violate standardization requirements or downstream scoring assumptions.

A workflow-governed system is designed to reduce that class of error: the model may adapt how a step is communicated, but not whether the step exists.

2) Stateful execution is easier to reason about in code than in prompt state

State-heavy processes degrade quickly when too much execution logic is delegated to conversational context. In a long-running agent loop, the system must continually preserve and reinterpret latent state across turns and tool results.

By contrast, a typed workflow architecture externalizes state:

progress is explicit
transition conditions are explicit
side effects are explicit
failure recovery is explicit

That makes the system easier to test, easier to audit, and easier to modify.

3) Reliability still drops sharply in realistic enterprise tasks

Recent benchmark evidence suggests that realistic multi-step enterprise execution remains difficult for frontier systems. In EnterpriseOps-Gym, a March 2026 benchmark covering 1,150 expert-curated tasks across eight enterprise domains, the best reported model reached 37.4% success. That result highlights the gap between impressive local agent behaviors and dependable end-to-end task completion in production-style environments.

The lesson is not that agentic techniques are useless. It is that long-horizon model-directed execution introduces many failure surfaces at once: decomposition, action choice, parameter selection, result interpretation, policy adherence, and state consistency.

Our architecture narrows that error surface by assigning different responsibilities to different components and keeping the most sensitive parts of execution outside autonomous control.

Why this matters specifically in recruiting

Recruiting automation differs from general-purpose assistance in one important way: the conversational participant is not the only stakeholder.

The recruiter, employer, or process owner defines:

required steps
evaluation criteria
transition rules
compliance boundaries
acceptable automation behavior

The candidate interacts with the system, but does not define its operating semantics.

That makes recruiting an example of process-governed automation, not pure user-driven assistance.

In that setting, giving the model broad autonomy can create a mismatch between conversational optimization and process correctness. A model may optimize for brevity, empathy, or local coherence while violating requirements that matter to the actual system owner.

A workflow-governed agent system resolves that by separating:

what must happen — encoded in workflow logic
how it is communicated — handled by LLMs

Safety and compliance are easier to enforce structurally

Another reason we preferred workflow control is that safety checks can be mandatory pipeline steps rather than optional model decisions.

Moderation, jailbreak detection, policy classification, and scope validation all run because the runtime executes them — not because the model chooses to. That can make prompt injection less effective, since the model does not own the decision of whether protective checks run.

This is particularly relevant in hiring. Under the EU AI Act, AI systems used for recruitment or selection of natural persons are explicitly classified as high-risk in Annex III, and the main obligations for those systems apply from 2 August 2026. In a high-risk context, system properties like traceability, human oversight, and technical documentation become architectural concerns, not just policy aspirations.

A workflow-based architecture helps because it produces explicit intermediate artifacts:

classifier outputs
selected execution path
collected structured data
triggered rules
resulting state transition

That is closer to an auditable system record than a free-form interleaving of reasoning and tool traces.

Tradeoffs

This architecture is intentionally less flexible than a fully autonomous agent.

It cannot improvise arbitrary new capabilities outside the defined execution graph. If a capability is not represented in the workflow, the system will not invent it. That is a limitation, but in a regulated, stateful, business-process domain, it is often the right limitation.

The upside is stronger guarantees around:

process fidelity
state consistency
safety enforcement
auditability
operational predictability

For recruiting automation, we have found those properties more valuable than unconstrained model autonomy.

Conclusion

We did not reject agentic techniques. We use them extensively — for routing, classification, synthesis, and bounded decision-making. What we rejected was fully autonomous control over execution.

The result is an architecture where LLMs contribute intelligence inside narrow interfaces, while code retains authority over workflow structure, state mutation, and side effects. For recruiting automation, that has been a better engineering fit for us than a single autonomous agent loop.

‍

March 18, 2026