Reflexion (Verbal Reinforcement Learning)

Sovereign AI

Reflexion is a framework for improving language-model agents without retraining their weights. Instead of gradient updates, the agent receives a feedback signal after a failed attempt, writes a short natural-language reflection on what went wrong, and stores that reflection in an episodic memory buffer; on the next trial the reflection is appended to the prompt, steering the agent toward better decisions. The authors — Shinn and colleagues, in work presented at NeurIPS 2023 — called the approach "verbal reinforcement learning": the reinforcement signal is real, but the policy update is a sentence, not a weight change.

How the loop works

A Reflexion agent runs three roles in a cycle. An actor produces actions — code, tool calls, navigation steps — attempting the task. An evaluator scores the resulting trajectory; the signal can be a scalar reward, a hard check like a unit-test pass/fail, or free-form critique. A self-reflection model then converts that outcome into a concise lesson in plain language: "I assumed the file existed without checking," or "I looped on the same failing query instead of reformulating." The lesson lands in an episodic memory buffer, and future attempts see it in their context. Because the buffer competes for a finite context window, it is bounded — only the most recent and relevant reflections carry forward, a design constraint as much as a choice.

Why text instead of gradients

Storing lessons as language has unusual virtues. It needs no training pipeline, no GPU hours, and no access to the model's weights — the whole loop runs against a plain inference endpoint. The lessons are human-readable, so you can audit exactly what your agent has "learned" and delete a bad lesson with a keystroke, a transparency no weight update offers. And the feedback is dense: a reflection can express why an attempt failed and what to do differently, where a scalar reward says only "bad." The approach showed strong gains on coding benchmarks, where unit tests provide exactly the kind of crisp evaluator the method thrives on, as well as on sequential decision-making and reasoning tasks.

The evaluator is the whole game

The trade-off is that gains depend entirely on the quality of the feedback signal. A noisy, vague, or gameable evaluator produces confident but wrong reflections, and the agent then diligently learns the wrong lesson — self-reinforcing error, delivered with perfect grammar. Reliable checks are what make the loop honest: real test suites, verifiable outcomes, or a separate verifier model grading trajectories. This is the same lesson reinforcement learning keeps teaching in every costume: optimization pressure flows toward whatever the evaluator actually measures, not what you meant it to measure.

Why it matters for sovereign builders

Reflexion is one of the most practical agent-improvement techniques for self-hosted setups precisely because it demands so little: a local model served by Ollama or llama.cpp, a task with a checkable outcome, and a text file's worth of memory. No fine-tuning rig, no data pipeline — trial, error, and written lessons, on hardware you control. It pairs naturally with iterative output polishing in Self-Refine and with the broader deliberation budget described under test-time compute: spend more inference, not more training, to get better behavior. The craftsman's summary: a bench notebook for your agent — cheap, legible, and exactly as trustworthy as the tests behind it.

One operational habit makes the technique markedly safer: review the reflection buffer periodically, exactly as you would review any configuration an automated process writes. Reflections are plain text, so pruning stale lessons, correcting a wrong one, or seeding a few hand-written rules of your own takes minutes and directly steers future behavior. An agent memory nobody reads drifts; an agent memory the operator curates becomes a genuinely useful, compounding asset — institutional knowledge for a workshop of one.

Reflexion is a framework for improving language-model agents without retraining their weights. Instead of gradient updates, the agent receives a feedback signal after a failed…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners