Grill-Me and Diagnose: Two AI Agent Skills That Changed How I Work

The Problem: AI Agents That Rush to Execute

We have all been there. You tell your AI assistant to build something, and within seconds it produces a plan, starts writing code, and confidently marches in the wrong direction. Thirty minutes later, you realize the misunderstanding happened in the first ten seconds.

The root cause is not stupidity. It is misalignment. The agent heard your words but did not understand your intent.

After months of debugging these failures, I discovered two skills from Matt Pocock that fundamentally changed how my AI agents work: grill-me and diagnose.

Skill 1: Grill-Me - The Art of Asking Before Doing

The Core Insight

Matt Pocock describes grill-me in just three sentences:

Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the decision tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.

Three sentences. But they invert the entire agent workflow.

Default flow: Human specifies, then Agent executes. Grill-me flow: Human describes, Agent interrogates, reach Shared understanding, then Execute.

The concept of a “decision tree” comes from Frederick Brooks’ The Design of Design. Every design is a tree of choices. You must walk every branch until all leaf nodes are resolved.

How I Enhanced It for My System

The original skill is beautifully minimal. But for my setup, I needed more guardrails. Here is what I added:

Three Iron Rules:

One question at a time. Multiple questions lead to shallow answers. The human will answer the last one and ignore the rest.
Every question must include a recommended answer. This prevents “interrogation fatigue.” The agent proposes, the human confirms or corrects. Efficient.
Never ask what you can find yourself. Search memory, read config files, check past decisions first. The human’s time is 100x more valuable than the agent’s.

Strict Exit Criteria:

Before declaring “alignment complete,” seven conditions must be met:

All decision points resolved
No dangling dependencies
Success criteria defined
Edge cases covered
Risks identified
Deliverable format confirmed
No contradictory decisions

If any condition fails, keep asking. No exceptions.

Durable Artifact:

The session output must be written to a file. Without a persistent artifact, agent statelessness erases everything. Next session starts cold.

Real Example

Human: "I want to build a project tracking dashboard"

Agent: Grill-Me activated.
       I am now in interrogator mode.
       I will not execute anything until we reach shared understanding.

       Q1: Target Users
       Who are the primary users of this system?

       My suggestion: Engineering team leads
       Reason: They need visibility across multiple projects
       Alternative: Entire company (but scope too broad)

Human: "Engineering team leads"

Agent: Q2: Success Criteria
       What does "done" look like for this system?
       ...

After 10-30 questions (depending on complexity), the agent generates a consensus document. The human confirms. Only then does execution begin.

Skill 2: Diagnose - Systematic Debugging for AI Agents

The Core Insight

Build the right feedback loop, and the bug is 90% fixed.

Most AI agents debug by guessing. They see an error, form a hypothesis, change something, and hope. This is not debugging. This is praying.

Matt Pocock’s diagnose skill enforces a six-phase loop with strict gates between each phase.

The Six Phases

Phase 1: Build a Feedback Loop

This is the skill. Everything else is mechanical. If you have a fast, deterministic pass/fail signal for the bug, you will find the cause.

Methods in priority order:

Direct retry (command fails, check exit code)
Unit test
HTTP/curl test
Log analysis
Minimal reproduction
Differential comparison
Stress loop (for non-deterministic bugs)

Phase 2: Reproduce

Run the loop. Watch the bug appear. Confirm it matches what the user described, not a different bug that happens to be nearby.

Phase 3: Hypothesize

Generate 3-5 ranked hypotheses before testing any of them. Each must be falsifiable:

“If X is the cause, then changing Y will make the bug disappear.”

If you cannot state the prediction, the hypothesis is a vibe. Discard it.

Phase 4: Instrument

Each probe must map to a specific prediction. Change one variable at a time. Tag every debug log with a unique prefix for cleanup.

Phase 5: Fix + Regression Test

Apply the fix. Re-run the feedback loop. Confirm the bug is gone and no side effects appeared.

Phase 6: Cleanup + Post-Mortem

Remove all debug instrumentation. Answer three questions:

What was the root cause?
What would have prevented this bug?
What did we learn?

Write it all down. Future you will thank present you.

Real Example: Memory Search Failure

Here is a real case I debugged recently:

Diagnose activated

Problem: Memory search occasionally fails
Method: 6-phase disciplined loop

Phase 1: Build feedback loop
  Run: openclaw memory status
  Found: "index metadata is missing"
  Vector search: paused

Phase 2: Reproduce
  Run: openclaw memory search "test"
  Confirmed: "Vector memory search is paused"

Phase 3: Hypotheses
  #1 Index metadata lost -> Predict: rebuild fixes it
  #2 Embedding API unstable -> Predict: curl test fails (ruled out)
  #3 Database corruption -> Predict: integrity check fails

Phase 4: Verify
  Test hypothesis #1: rebuild index
  Result: Matches prediction

Phase 5: Fix
  openclaw memory index --force
  Verify: search works

Phase 6: Post-mortem
  Root cause: Index metadata lost after update
  Prevention: Add health check to monitoring
  Learning: memory search depends on index metadata

Total time: 3 minutes. Zero guessing.

How These Skills Work Together

The two skills form a complete lifecycle:

Grill-Me (before) prevents bugs caused by misalignment
Execute the agreed plan
Diagnose (when things break) systematically resolves issues
Learn from post-mortems, compound over time
Grill-Me again next time, smarter than before

Both enforce discipline. Both produce durable artifacts. Both learn from history.

Implementation Notes

These skills are designed for OpenClaw, but the principles apply to any AI agent framework:

Role inversion is powerful. Make the agent challenge plans, not just execute them.
Feedback loops are everything. Without a pass/fail signal, you are guessing.
Gate each phase. Do not let the agent skip to “fix” before “reproduce.”
Persist everything. Agent sessions are stateless. Files are not.
Search history first. The answer might already exist.

The Results

Since implementing these skills:

Zero rework from misalignment (grill-me catches it upfront)
Faster debugging (diagnose eliminates guessing)
Better learning (post-mortems compound over time)
Higher trust (the agent asks before acting on risky tasks)

The investment is real. Grill-me sessions can take 10-30 minutes. But that is 10-30 minutes saved from rework, misunderstanding, and debugging.

What disciplined processes have you implemented to keep your AI agents aligned and effective?

The Problem: AI Agents That Rush to Execute#

Skill 1: Grill-Me - The Art of Asking Before Doing#

The Core Insight#

How I Enhanced It for My System#

Real Example#

Skill 2: Diagnose - Systematic Debugging for AI Agents#

The Core Insight#

The Six Phases#

Real Example: Memory Search Failure#

How These Skills Work Together#

Implementation Notes#

The Results#

The Problem: AI Agents That Rush to Execute

Skill 1: Grill-Me - The Art of Asking Before Doing

The Core Insight

How I Enhanced It for My System

Real Example

Skill 2: Diagnose - Systematic Debugging for AI Agents

The Core Insight

The Six Phases

Real Example: Memory Search Failure

How These Skills Work Together

Implementation Notes

The Results