A surprisingly hard challenge building AI agents is putting a human in the loop.
If you want your agent to be able to perform critical tasks in production, it probably needs to wait for human approval. However, because there are real people involved, approval doesn't always happen instantly, and agents need to be able to wait hours or days for human intervention, then quickly resume when it arrives.
This creates a new reliability problem for agents, as they’re now running for hours or days instead of seconds or minutes. As they’re running for longer, it’s much more likely they’re interrupted (server maintenance, code upgrade, process crash) while waiting. For agents to really be usable in production, they need to be able to automatically recover from these interruptions and resume from where they left off.
Durable workflows can help make long-running, human-in-the-loop agents resilient to failure. The idea is to checkpoint an agent’s progress in a database so that if the agent is interrupted, it can recover and resume from its last checkpoint.
To handle human-in-the-loop specifically, we can use a database-backed messaging system where an agent awaits a notification delivered through the database. When the agent first starts waiting, it checkpoints a timeout. If the agent is interrupted, it recovers from its checkpoints and continues waiting towards the timeout. When a human approves the agent, the approval message is written to a database table so that when the agent is ready and recovered, it can read the message and continue execution. That way, an agent can run for days waiting for human approval and be ready to go as soon as it arrives.