Agentic Patterns Snippets

AI 에이전트 설계 패턴 레퍼런스 · Original: agentic-patterns.com

Context-Minimization Pattern

Context & Memory
Flow
User Input ──▶ [Transform] ──▶ Safe Output
    │              │
    │         ┌────┴────┐
    └────────▶│ REMOVE  │
              │ tainted │
              └─────────┘

Context:  [████ tainted ████] → [██ clean ██]
flowchart LR A[User Input] --> B[Transform] B --> C[Remove Input] C --> D[Execute] D --> E[Summarize]
Example
sql = LLM("to SQL", user_prompt)
remove(user_prompt)  # tainted tokens gone
rows = db.query(sql)
answer = LLM("summarize", rows)  # clean context
Problem

User-supplied text lingers in context, enabling it to influence later generations and potentially inject malicious instructions

Solution

Purge untrusted segments after transforming into safe intermediate. Later reasoning sees only trusted data

When to use
  • Customer service chat
  • Medical Q&A systems
  • Multi-turn flows where input shouldn't steer later steps
Trade-offs

Pros

  • Simple, no extra models
  • Prevents prompt injection

Cons

  • Loses conversational nuance
  • May hurt UX if too aggressive
View Original →

Context Window Anxiety Management

Context & Memory
Flow
Context Window: [████████████░░░░░░░░] 60%

Without Management:          With Management:
├─ "Running low..."         ├─ "Plenty of space"
├─ Summarize early          ├─ Continue working
└─ Rush completion          └─ Thorough output

Buffer: Enable 1M ──▶ Cap at 200k ──▶ Psychological runway
flowchart LR A[Enable 1M tokens] --> B[Cap usage at 200k] B --> C[Add reassurance prompts] C --> D[Model works thoroughly]
Example
prompt = """
CONTEXT GUIDANCE: You have 200k+ tokens.
Do NOT rush or summarize prematurely.
""" + user_input + """
Remember: Context is NOT a constraint.
"""
Problem

Models exhibit 'context anxiety' near window limits, prematurely summarizing or rushing to complete tasks.

Solution

Provide buffer headroom and explicit reassurance that context is abundant to override anxiety behaviors.

When to use
  • Long coding or research sessions
  • Tasks requiring sustained attention
  • Model mentions 'running out of space'
Trade-offs

Pros

  • Prevents premature task abandonment
  • Enables more thorough work
  • Overcomes model behavioral quirks

Cons

  • Requires model-specific tuning
  • May increase token usage
  • Aggressive prompting overhead
View Original →

Dynamic Context Injection

Context & Memory
Flow
User ──▶ "@Button.tsx" ──▶ [Read File] ──▶ Inject
                                              │
User ──▶ "/user:deploy" ──▶ [Load Cmd] ──────┤
                                              ▼
                         Agent [Enriched Context] ──▶ Continue
sequenceDiagram participant U as User participant A as Agent participant FS as File System participant C as Commands U->>A: @src/Button.tsx A->>FS: Read file FS-->>A: Content U->>A: /user:deployment A->>C: Load command C-->>A: Instructions A-->>U: Continue with context
Example
# File injection via @mention
@src/components/Button.tsx

# Slash command injection
/user:deployment
# Loads ~/.claude/commands/deployment.md

# Both inject into agent context
Problem

Agents often need specific context on-demand, but constantly editing static files or pasting text is inefficient.

Solution

Use @mentions for files and /slash commands for reusable prompts to dynamically inject context during sessions.

When to use
  • Need specific file contents mid-task
  • Frequently reuse complex instructions
  • Want fluid context management
Trade-offs

Pros

  • Targeted context injection
  • Reusable slash commands
  • Efficient lazy loading

Cons

  • Must learn special syntax
  • May inject too much context
  • Command setup overhead
View Original →

Filesystem-Based Agent State

Context & Memory
Flow
workspace/
├── state/
│   ├── step1_results.json  ◄── Checkpoint
│   ├── step2_results.json  ◄── Checkpoint
│   └── progress.txt
└── logs/
    └── execution.log

Step1 ──▶ [Save] ──▶ Step2 ──▶ [Save] ──▶ Step3
              │                    │
              ▼                    ▼
         [Interrupt]          [Resume]
flowchart LR A[Start] --> B{Checkpoint Exists?} B -->|Yes| C[Load State] B -->|No| D[Execute Step] C --> D D --> E[Save Checkpoint] E --> F{More Steps?} F -->|Yes| D F -->|No| G[Done]
Example
if os.path.exists("state/step1.json"):
    data = json.load(open("state/step1.json"))
else:
    data = perform_step1()
    json.dump(data, open("state/step1.json", "w"))
# Resume from checkpoint if interrupted
Problem

Long-running agent workflows lose all progress when interrupted, as state in context window does not persist across sessions.

Solution

Persist intermediate results to files, creating durable checkpoints that enable workflow resumption and failure recovery.

When to use
  • Multi-step workflows with expensive operations
  • Long-running tasks exceeding session limits
  • Workflows needing recovery from failures
Trade-offs

Pros

  • Enables workflow resumption
  • Protects against data loss

Cons

  • Requires checkpoint/recovery logic
  • File I/O adds overhead
View Original →

Layered Configuration Context

Context & Memory
Flow
Enterprise ──┐
  ~/.claude ─┼──▶ Merge ──▶ Agent Context
   project/ ─┤
    .local ──┘

Priority: local > project > user > enterprise
flowchart LR E[Enterprise] --> M[Merge] U[User Global] --> M P[Project] --> M L[Local] --> M M --> C[Context Window]
Example
# Auto-discovered hierarchy
/enterprise/CLAUDE.md    # Org policies
~/.claude/CLAUDE.md      # User prefs
./CLAUDE.md              # Project rules
./CLAUDE.local.md        # Personal overrides
Problem

Manually providing context in every prompt is cumbersome; global context is too broad or narrow.

Solution

Auto-discover and merge layered config files (enterprise, user, project, local) by filesystem hierarchy.

When to use
  • Multi-project environments
  • Team-wide policies
  • Personal customization
Trade-offs

Pros

  • Zero manual context
  • Scoped customization

Cons

  • Merge conflicts
  • Discovery complexity
View Original →

Curated Code Context Window

Context & Memory
Flow
MainAgent: "Find UserModel definitions"
     │
     ▼
SearchSubagent ──▶ [Index] ──▶ Top 3 snippets
     │
     ▼
Context: [user_service.py] [models/user.py] [auth.py]
     │
     ▼
MainAgent: edit_file(UserService) ✓
sequenceDiagram MainAgent->>SearchSubagent: Find UserModel SearchSubagent-->>MainAgent: 3 file snippets MainAgent->>Context: Inject snippets MainAgent->>Tool: edit_file()
Example
search = SearchSubagent(index="code_index")
snippets = search.find("UserModel", top_k=3)

context.inject(snippets)  # Only 3 files
agent.edit("UserService")  # Focused work
Problem

Dumping entire repositories into context overwhelms the model with noise and slows inference.

Solution

Use a search subagent to inject only top-K relevant code snippets into the main agent's context.

When to use
  • Working in large codebases
  • Need focused reasoning on specific modules
  • Token efficiency is critical
Trade-offs

Pros

  • Noise reduction improves clarity
  • Dramatically reduces token usage
  • Mitigates context anxiety

Cons

  • Index must stay fresh
  • Adds search subagent complexity
  • May miss edge-case dependencies
View Original →

Episodic Memory Retrieval

Context & Memory
Flow
Episode Done ──▶ Write Memory Blob ──▶ [Vector DB]
                                           │
New Task ──▶ Embed Prompt ──▶ Query ───────┘
                               │
                               ▼
                         top-k Memories
                               │
                               ▼
               Agent [Context + Hints] ──▶ Execute
sequenceDiagram participant A as Agent participant V as Vector DB A->>V: Store memory (event, outcome) Note over A: New task arrives A->>V: Query similar memories V-->>A: top-k results A->>A: Inject as context hints A->>A: Execute with memory
Example
# After episode completion
memory_db.write({
    "event": "refactored auth module",
    "outcome": "broke session handling",
    "rationale": "missed dependency"
})

# On new task
hints = memory_db.retrieve(task, top_k=3)
agent.execute(task, context_hints=hints)
Problem

Stateless calls make agents forget prior decisions, causing repetition and shallow reasoning.

Solution

Add vector-backed episodic memory: store event/outcome/rationale blobs, retrieve top-k similar memories, inject as hints.

When to use
  • Long-running agent sessions
  • Need continuity across tasks
  • Avoiding repeated mistakes
Trade-offs

Pros

  • Richer continuity
  • Fewer repeated mistakes
  • Learns from past

Cons

  • Retrieval noise if uncurated
  • Storage cost
  • Stale memory issues
View Original →

Memory Synthesis from Execution Logs

Context & Memory
Flow
Task Diaries              Synthesis
────────────              ─────────
[Diary 1] ──┐
[Diary 2] ──┼──▶ Synthesis Agent ──▶ Patterns
[Diary 3] ──┘              ↓
                    CLAUDE.md / Commands
flowchart LR D1[Diary 1] --> S[Synthesis] D2[Diary 2] --> S D3[Diary 3] --> S S --> P[Patterns] P --> M[CLAUDE.md]
Example
# Synthesis prompt
synthesis_agent("""
Review 50 task diaries.
Find patterns appearing 3+ times.
Output: rules, commands, tests
""")
Problem

Task logs contain valuable learnings but are too specific; hard to know which patterns generalize.

Solution

Write structured task diaries, then periodically synthesize across logs to extract reusable patterns.

When to use
  • Recurring task types
  • Learning from failures
  • Building organizational memory
Trade-offs

Pros

  • Pattern discovery
  • Evidence-backed rules

Cons

  • Storage overhead
  • False pattern risk
View Original →

Proactive Agent State Externalization

Context & Memory
Flow
Agent ──▶ [Work] ──▶ Write Notes ──▶ [state.md]
                          │
                          ▼
               ┌─────────────────────┐
               │  Template Schema    │
               │  • Objective        │
               │  • Progress         │
               │  • Knowledge Gaps   │
               └─────────────────────┘
                          │
                          ▼
                Validate ──▶ External Memory
flowchart LR A[Agent Work] --> B[Write Notes] B --> C[Template Validation] C --> D{Complete?} D -->|Yes| E[Merge to Memory] D -->|No| F[Prompt Clarification] F --> B
Example
class StateManager:
    def capture_state(self, agent_notes):
        structured = self.parse_notes(agent_notes)
        missing = self.validate(structured)
        if missing:
            return self.prompt_clarification(missing)
        return self.merge_with_memory(structured)
Problem

Models proactively write notes to preserve state, but self-generated summaries are often incomplete and may consume tokens better spent on task execution.

Solution

Provide structured templates and validation for agent self-documentation, combining agent notes with external memory systems as fallback.

When to use
  • Long-running development sessions
  • Multi-session research tasks
  • Subagent coordination requiring state communication
Trade-offs

Pros

  • Leverages natural model behavior
  • Enables session continuity
  • Creates audit trails

Cons

  • May consume tokens on documentation over progress
  • Risk of incomplete self-assessment
  • Requires validation overhead
View Original →

Curated File Context Window

Context & Memory
Flow
Task: "Add validation to signup()"
           │
           ▼
  ┌─────────────────────────────────────┐
  │ PRIMARY: UserController.java (full) │
  ├─────────────────────────────────────┤
  │ CONTEXT SNIPPETS:                   │
  │  - UserService.validateUser()       │
  │  - SignupDTO: fields + annotations  │
  └─────────────────────────────────────┘
           │
           ▼
      Agent: Focused edit ✓
flowchart TB A[Parse task] --> B[Identify primary files] B --> C[Search subagent: rg/index] C --> D[Extract snippets] D --> E[Assemble context] E --> F[Agent executes]
Example
primary = load_full("UserController.java")
secondary = search.find("signup", top_n=5)
snippets = [extract_methods(f) for f in secondary]

context = f"""
### PRIMARY: {primary}
### SNIPPETS: {snippets}
"""
Problem

Loading all files into prompt exceeds token limits and introduces noise from unrelated code.

Solution

Load only primary files plus summarized secondary files from a search subagent's ranked results.

When to use
  • Multi-file refactoring tasks
  • Feature implementation in large repos
  • Need to minimize hallucinations
Trade-offs

Pros

  • Minimal prompt size, on-target
  • Faster responses, fewer hallucinations
  • Scales to large repositories

Cons

  • Requires file-search service
  • May miss critical files if ranking is off
  • Index needs to stay current
View Original →

Agent-Powered Codebase Q&A / Onboarding

Context & Memory
Flow
Developer                  Agent                  Codebase
    │                        │                        │
    │  "Where is DB config?" │                        │
    ├───────────────────────▶│                        │
    │                        │   Search/Index/Analyze │
    │                        ├───────────────────────▶│
    │                        │◀───────────────────────┤
    │  "config/database.js"  │                        │
    │◀───────────────────────┤                        │
sequenceDiagram Developer->>Agent: "Where is the database connection configured?" Agent->>Codebase: Search/Analyze Agent-->>Developer: "It's configured in config/database.js"
Example
# Natural language codebase Q&A
agent.ask("How does user auth work?")
# Agent searches, analyzes, responds:
# "Auth is in auth/service.py, uses JWT,
#  called by UserController.login()"
Problem

대규모 코드베이스 이해와 온보딩이 어렵고, 수동으로 코드 경로를 추적하는 데 시간이 많이 소요됨

Solution

검색/인덱싱/Q&A 능력을 갖춘 AI 에이전트가 자연어 질문에 답하며 코드 구조를 설명

When to use
  • 새 프로젝트에 온보딩할 때
  • 복잡한 시스템 디버깅 시
  • 코드 간 상호작용 파악이 필요할 때
  • 특정 모듈/파일의 목적을 빠르게 이해해야 할 때
Trade-offs

Pros

  • 온보딩 시간 대폭 단축
  • 자연어로 코드베이스 탐색 가능
  • 컨텍스트 기반 정확한 답변

Cons

  • 인덱싱 품질에 답변 품질 의존
  • 대규모 코드베이스 인덱싱 비용
View Original →

Background Agent with CI Feedback

Feedback Loops
Flow
Dev ──▶ Agent: "Upgrade to React 19"
              │
              ▼
        [Push Branch] ──▶ CI Tests
              │               │
              │◀──── 12 fails ┘
              ▼
        [Patch imports]
              │
              ▼
        [Re-run CI] ──▶ All Green
              │
              ▼
        Agent ──▶ Dev: "PR ready!"
sequenceDiagram Dev->>Agent: "Upgrade to React 19" Agent->>Git: push branch Agent-->>CI: trigger tests CI-->>Agent: 12 failures Agent->>Files: patch imports Agent-->>CI: re-run CI-->>Agent: all green Agent-->>Dev: PR ready
Example
# Agent runs in background
git checkout -b react19-upgrade
# Make changes...
git push origin react19-upgrade

# CI runs automatically
# Agent polls for results
# On failure: patch and retry
# On success: notify developer
Problem

Long-running tasks tie up the editor and require developers to babysit the agent throughout execution.

Solution

Run agent asynchronously: push branch, wait for CI, ingest pass/fail output, iterate automatically, and notify when green.

When to use
  • Long-running upgrade or refactoring tasks
  • Developer wants to work on other things
  • Mobile kick-offs (e.g., fix tests while away)
Trade-offs

Pros

  • Developer freed from babysitting
  • Uses existing CI as feedback loop
  • Fully autonomous iteration

Cons

  • Requires CI integration and permissions
  • May iterate on wrong direction without oversight
View Original →

Dogfooding with Rapid Iteration

Feedback Loops
Flow
┌─────────────────────────────────────────────┐
│              DOGFOODING LOOP                │
└─────────────────────────────────────────────┘

Build ──▶ Use Daily ──▶ Find Issues ──▶ Fix Fast
  ▲                                        │
  └────────────────────────────────────────┘

Team(70-80%) ──▶ Feedback(5min) ──▶ Iterate
sequenceDiagram participant D as Dev Team participant A as AI Agent participant F as Feedback Channel D->>A: Use for daily tasks A-->>D: Pain points discovered D->>F: Report issue (every 5min) F->>D: Prioritize fix D->>A: Deploy improvement Note over D,A: Tight loop repeats
Example
# Anthropic "ant-fooding" approach
team_adoption = 0.80  # 80% daily usage
feedback_interval = "5min"

# Feature validation flow
feature = push_to_internal()
feedback = collect_team_feedback(feature)
if not feedback.positive:
    unship(feature)  # Fast pivot
Problem

External feedback loops are slow and simulated environments miss real-world nuances for agent improvement.

Solution

Development team uses their own AI agent daily, creating tight feedback loops for rapid iteration and honest assessment.

When to use
  • Building AI-assisted dev tools
  • Need rapid feature validation
  • Team wants unfiltered feedback
Trade-offs

Pros

  • Direct, immediate feedback
  • Real-world problem testing
  • Fast iteration cycles

Cons

  • May bias toward developer needs
  • Internal users != all users
  • Requires team adoption commitment
View Original →

Graph of Thoughts (GoT)

Feedback Loops
Flow
        ┌── T1 ──┐
        │        ▼
Problem ┼── T2 ──┼──▶ Aggregate ──▶ Refine ──▶ Solution
        │        ▲         ▲
        └── T3 ──┘         │
             │             │
             └── Loop ─────┘

Branch → Aggregate → Refine → Loop back if needed
graph TD A[Problem] --> B[Thought 1] A --> C[Thought 2] B --> D[Thought 3] C --> D D --> E[Aggregated] E --> F[Refined] F --> G[Solution]
Example
got = GraphOfThoughts(llm, max_thoughts=50)
got.add_thought(root)
for thought in thoughts_to_expand:
    got.branch_thought(thought)   # Generate alternatives
    got.aggregate_related()       # Combine insights
    got.refine_thought(thought)   # Improve based on context
return got.extract_best_solution()
Problem

Linear Chain-of-Thought reasoning cannot handle problems with complex interdependencies requiring paths that merge, split, and recombine.

Solution

Represent reasoning as a directed graph where thoughts can branch, aggregate, refine, and loop back for iterative improvement.

When to use
  • Complex problems with interdependent reasoning
  • Tasks requiring insight aggregation
  • Problems needing iterative refinement
Trade-offs

Pros

  • Handles non-linear reasoning
  • Combines insights from multiple paths

Cons

  • Higher computational cost
  • Complex to implement
View Original →

Reflection Loop

Feedback
Flow
                    ┌──────────────────┐
                    │                  │
                    ▼                  │ No
Input ──▶ Generate ──▶ Evaluate ──▶ [score ≥ θ?] ──▶ Done
                                       │ Yes
                    ▲                  │
                    │    Feedback      │
                    └──────────────────┘

Quality:  ░░░░ → ▒▒▒▒ → ▓▓▓▓ → ████
flowchart TD A[Start] --> B[Generate] B --> C[Evaluate] C --> D{Pass?} D -->|Yes| E[Done] D -->|No| F[Feedback] F --> B
Pseudo
for attempt in range(max_iters):
    draft = generate(prompt)
    score, critique = evaluate(draft, metric)

    if score >= threshold:
        return draft

    prompt = incorporate(critique, prompt)

return draft  # best effort
Problem

LLM이 자기 결과물을 검토하지 않으면 품질이 낮거나 요구사항을 충족하지 못할 수 있음

Solution

생성 후 자체 평가 → 피드백 반영 → 재생성 루프를 기준 충족까지 반복

When to use
  • 품질/기준 준수가 중요할 때
  • 코드, 글쓰기, 추론 작업
  • 명확한 평가 메트릭이 있을 때
Trade-offs

Pros

  • 자동 품질 향상
  • 명시적 기준 준수

Cons

  • 추가 연산 비용
  • 메트릭 모호 시 무한루프
View Original →

Spec-As-Test Feedback Loop

Feedback Loops
Flow
Spec Change ──▶ Generate Tests ──▶ Run Tests ──┐
                                                │
    ┌───────────────────────────────────────────┘
    │  Fail?
    ▼
Agent PR: Fix Code or Flag Spec ──▶ Review
flowchart LR A[Spec/Code Commit] --> B[Generate Tests] B --> C[Run Tests] C -->|Pass| D[Done] C -->|Fail| E[Agent PR] E --> F[Fix Code or Flag Spec]
Example
on_commit(spec_or_code):
    tests = generate_tests(spec.latest)
    result = run_tests(tests)
    if result.failed:
        create_pr(fix_or_flag(result))
Problem

Implementations can drift from specs as code evolves, causing silent divergence

Solution

Auto-generate executable tests from specs and run on every commit, with agent-authored PRs for fixes

When to use
  • Spec-first development workflows
  • Critical systems requiring spec-impl sync
  • Continuous integration environments
Trade-offs

Pros

  • Catches drift early
  • Keeps spec and impl in lock-step

Cons

  • Heavy CI usage
  • False positives with loose spec wording
View Original →

Tool Use Incentivization via Reward Shaping

Feedback Loops
Flow
Agent ──▶ compile() ──▶ [+1.0] ──▶ lint() ──▶ [+0.5]
                │                       │
                └───────────────────────┴──▶ test() ──▶ [+2.0]
                                                         │
                                              ∑ rewards ──▶ Policy Update
flowchart LR A[Agent] --> B[compile] B -->|+1| C[lint] C -->|+0.5| D[test] D -->|+2| E[Sum Rewards] E --> F[Policy Update]
Example
# RL step: shaped rewards for tool calls
if action == "compile":
    local_reward = 1 if compile_success else -0.5
elif action == "run_tests":
    local_reward = 2 if new_tests_passed else 0
trajectory.append((state, action, local_reward))
Problem

Agents underutilize tools (compilers, linters, tests) and default to internal thinking tokens instead of invoking external tools.

Solution

Provide dense shaped rewards for each intermediate tool invocation (+1 compile, +2 test pass) to guide policy toward tool usage.

When to use
  • RL-based agent training
  • Tool use is underutilized
  • Sparse final rewards are insufficient
Trade-offs

Pros

  • Denser feedback guides step-by-step
  • Encourages tool adoption

Cons

  • Reward engineering overhead
  • May game intermediate rewards
View Original →

Coding Agent CI Feedback Loop

Feedback Loops
Flow
Agent ──▶ [Create Branch] ──▶ CI
              │                  │
              │◀── Partial Fails ┘
              ▼
        [Patch Specific Files]
              │
              ▼
        [Re-run Failed Tests] ──▶ CI
              │                    │
              │◀──── Still Fails? ─┘
              │        │
              │        └──▶ Patch Again
              ▼
        All Green ──▶ User: "PR Ready"
sequenceDiagram Agent->>Git: create branch Agent->>CI: trigger tests loop every 30s CI-->>Agent: partial failures Agent->>Files: patch specific failures Agent->>CI: re-run only failing tests end CI-->>Agent: all green Agent-->>User: PR ready to merge
Example
# Error parsing from CI logs
def parse_ci_failures(logs):
    return [
        {"file": "auth.py", "line": 42,
         "error": "Expected status 200"}
    ]

# Prioritized re-run
agent.run_tests(only_files=patched_files)
# Notify on completion
if all_green: notify_user("PR ready")
Problem

Synchronous test runs block agent from parallel work, creating idle compute and inflated training times as agent babysits builds.

Solution

Run agent async against CI: push branch, poll for partial failures, patch iteratively, notify on final green.

When to use
  • Multi-file refactors or feature additions
  • Long test suites that block iteration
  • Need to maximize compute utilization
Trade-offs

Pros

  • Compute efficiency - overlaps generation and testing
  • Faster iteration with less waiting
  • Autonomous until final green

Cons

  • CI flakiness can mislead patches
  • Security - agent needs CI push/read permissions
View Original →

Inference-Healed Code Review Reward

Feedback Loops
Flow
             ┌── Correctness ──▶ 1.0 ──┐
             │                         │
Patch ──▶ Critic ─── Style ──────▶ 0.8 ──┼──▶ Weighted ──▶ 0.7
             │                         │      Sum
             ├── Performance ──▶ 0.4 ──┤
             │                         │
             └── Security ────▶ 0.6 ──┘

+ CoT: "O(n^2) loop caused perf regression"
flowchart LR A[Patch] --> B[Critic] B --> C[Correctness: 1.0] B --> D[Style: 0.8] B --> E[Performance: 0.4] B --> F[Security: 0.6] C & D & E & F --> G[Weighted Sum: 0.7] G --> H[Feedback + Comments]
Example
subscores = {
    "correctness": test_critic.score(patch),
    "style": linter_critic.score(patch),
    "performance": perf_critic.score(patch),
    "security": security_critic.score(patch),
}
final = sum(w[k]*subscores[k] for k in subscores)
return final, subscores, "O(n^2) loop detected"
Problem

Binary 'tests passed' rewards miss nuanced code quality issues like performance regressions, style violations, and security problems.

Solution

Use a multi-criteria code review critic that decomposes quality into subcriteria (correctness, style, perf, security) with explainable subscores.

When to use
  • RL-based code generation agents
  • When code quality beyond correctness matters
  • Continuous improvement of code agents
Trade-offs

Pros

  • Explainable feedback for targeted fixes
  • Higher code quality via non-functional criteria

Cons

  • Compute overhead
  • Critic model maintenance
View Original →

Rich Feedback Loops

Feedback Loops
Flow
Agent ──▶ [Action] ──▶ [Tool] ──▶ Feedback ──┐
  ▲                                          │
  │       ┌──────────────────────────────────┘
  │       ▼
  └─── [Parse] ◀── errors, test fails, lint, screenshots

Loop: action → feedback → fix → action → ...
sequenceDiagram Agent->>CLI: go test ./... CLI-->>Agent: FAIL auth_test.go:42 Agent->>File: patch handler Agent->>CLI: go test ./... CLI-->>Agent: PASS 87/87 tests
Concept
# Rich feedback > perfect prompts
result = agent.run(task)
feedback = get_diagnostics(result)  # errors, lint
if feedback.has_issues:
    agent.fix(feedback)  # self-debugging loop
Problem

Polishing a single prompt can't cover every edge case. Agents need ground truth to self-correct

Solution

Expose iterative, machine-readable feedback (compiler errors, test failures, lint) after every tool call. Agent uses diagnostics to self-debug

When to use
  • Code generation tasks
  • Any task with verifiable output
  • When tests/linters are available
Trade-offs

Pros

  • Emergent self-debugging
  • Better than bigger prompts

Cons

  • Requires feedback infrastructure
  • More iterations = more tokens
View Original →

Self-Critique Evaluator Loop

Feedback Loops
Flow
Instruction ──▶ Generate Candidates ──▶ [A, B, C]
                                            │
                                            ▼
                                   Judge: "B > A > C"
                                   (with reasoning)
                                            │
                                            ▼
                                  Fine-tune on Judgments
                                            │
                     ┌──────────────────────┴──────────────────────┐
                     ▼                                             ▼
              Improved Evaluator                          Use as Reward Model
flowchart TD A[Generate Candidates] --> B[Judge with Reasoning] B --> C[Fine-tune Evaluator] C --> D[Improved Judge] D --> B D --> E[Use as Reward Model]
Example
def self_taught_evaluator_loop(model, instructions):
    candidates = [model.generate(i) for i in instructions]
    judgments = model.judge_and_explain(candidates)
    model.finetune(judgments)  # Train on own traces
    return model  # Now better at evaluation
Problem

Human preference labels are costly and quickly become outdated as base models improve, creating a bottleneck for reward model training.

Solution

Train a self-taught evaluator that bootstraps from synthetic data: generate candidates, judge with reasoning traces, fine-tune on its own judgments, and iterate.

When to use
  • Human labels too expensive to scale
  • Base models evolving rapidly
  • Need automated quality gates
Trade-offs

Pros

  • Near-human eval accuracy without labels
  • Scales with compute
  • Self-improving over time

Cons

  • Risk of evaluator-model collusion
  • Needs adversarial testing
  • May amplify systematic errors
View Original →

Self-Discover Reasoning Structures

Feedback Loops
Flow
Task ──▶ Analyze ──▶ Select Modules ──▶ Adapt ──▶ Compose
                         │
         ┌───────────────┴───────────────┐
         │  Module Library               │
         │  • Break into steps           │
         │  • Work backwards             │
         │  • Find patterns              │
         └───────────────────────────────┘
                                              │
                                              ▼
                              Task-Specific Reasoning Structure
                                              │
                                              ▼
                                         Execute ──▶ Solution
flowchart LR A[Task] --> B[Analyze] B --> C[Select Modules] C --> D[Adapt to Task] D --> E[Compose Structure] E --> F[Execute] F --> G[Solution] H[Module Library] --> C
Example
def self_discover_solve(task, modules):
    # Select relevant modules for this task
    selected = llm.select(task, modules)
    # Adapt to specific problem
    adapted = llm.adapt(task, selected)
    # Compose reasoning structure
    structure = llm.compose(adapted)
    return llm.solve_with(task, structure)
Problem

Different reasoning tasks require different thinking strategies. Fixed reasoning patterns like Chain-of-Thought may be suboptimal for diverse problems.

Solution

Enable LLMs to automatically discover and compose task-specific reasoning structures by selecting and adapting atomic reasoning modules to match the problem's characteristics.

When to use
  • Diverse reasoning tasks
  • Standard CoT underperforming
  • Novel problem types
Trade-offs

Pros

  • Up to 32% improvement over CoT
  • Creates reusable reasoning templates
  • Adapts to novel problems

Cons

  • Overhead for structure discovery
  • May over-engineer simple problems
  • Depends on task analysis accuracy
View Original →

Agent Reinforcement Fine-Tuning

Learning & Adaptation
Flow
Sample ──▶ Model ──▶ Tool Call? ──▶ Your Endpoint
              ▲          │              │
              │          ▼              ▼
              │     Final Answer   Add to Context
              │          │              │
              │          ▼              │
              │      Grader ◀───────────┘
              │          │
              └── Reward ┘

[End-to-end training with real tools]
flowchart TD A[Training Sample] --> B[Model Rollout] B --> C{Tool Call?} C -->|Yes| D[Call Tool Endpoint] D --> E[Add Response to Context] E --> B C -->|No| F[Final Answer] F --> G[Grader Evaluates] G --> H[Reward Signal] H --> I[Update Weights]
Example
client.fine_tuning.jobs.create(
    model="gpt-4o",
    method="rft",
    rft={
        "tools": [{"url": "https://api/search"}],
        "grader": {"type": "model", "model": "gpt-4o"},
        "hyperparameters": {"compute_multiplier": 1}
    }
)
Problem

기본 모델은 분포 이동과 비효율적인 도구 사용으로 인해 도메인 특화 작업에서 성능이 떨어짐

Solution

실제 도구 호출, 커스텀 그레이더, 다단계 강화학습으로 에이전트 작업에 대해 모델을 엔드투엔드로 학습

When to use
  • 기본 모델과의 분포 이동
  • 비효율적인 도구 사용 패턴
  • 100-1000개의 품질 샘플이 있을 때
Trade-offs

Pros

  • 엔드투엔드 최적화
  • 샘플 효율적 (100-1000 샘플)

Cons

  • 인프라 복잡성
  • 신중한 보상 엔지니어링 필요
View Original →

AI-Assisted Code Review

Feedback Loops
Flow
Code ──▶ AI Analyzer ──▶ Issues/Summary ──▶ Human Review
  │                            │                   │
  │   "Explain this change?"   │                   │
  └────────────────────────────┴───── Q&A ◀───────┘

Intent Alignment: [Mind's Eye] ◀─── Verify ─── [Generated Code]
flowchart LR A[Code Changes] --> B[AI Analyzer] B --> C[Issues & Summary] C --> D[Human Reviewer] D --> E{Aligned?} E -- Yes --> F[Approve] E -- No --> G[Request Changes] D --> H[Ask AI to Explain] H --> D
Example
# PR Review workflow
ai_review = agent.analyze_pr(diff)
print(ai_review.summary)
print(ai_review.issues)

# Interactive Q&A
answer = agent.explain("Why was this refactored?")
if aligned_with_intent(answer):
    pr.approve()
Problem

AI가 생성하는 코드량이 증가하면서 검증이 병목이 됨. 문법적 정확성뿐 아니라 의도 부합 여부 확인이 중요해짐

Solution

AI로 코드 변경사항을 분석하고, 이슈/요약을 제공하며, 인터랙티브 Q&A로 의도 정렬을 검증

When to use
  • AI 생성 코드의 PR 리뷰 시
  • 대규모 코드베이스 변경 검토
  • 명세가 모호한 작업의 결과 검증
  • 팀 전체 리뷰 효율성 향상 필요 시
Trade-offs

Pros

  • 리뷰 속도 및 효율성 향상
  • AI 설명으로 이해도 증가
  • 의도 정렬 검증 가능

Cons

  • AI 분석의 정확도 의존
  • 과도한 신뢰로 인한 오류 간과 위험
View Original →

Compounding Engineering Pattern

Learning & Adaptation
Flow
┌─────────────────────────────────────────────────────┐
│               COMPOUNDING LOOP                      │
│                                                     │
│  Build ──▶ Learn ──▶ Codify ──▶ Easier Build       │
│    ▲                               │                │
│    └───────────────────────────────┘                │
│                                                     │
│  Outputs: CLAUDE.md, /commands, hooks, subagents   │
└─────────────────────────────────────────────────────┘
flowchart LR A[Build Feature] --> B[Document Learnings] B --> C[Codify into Prompts] C --> D[Next Feature] D --> E[Easier Build] E --> A
Example
# After completing feature:
1. Update CLAUDE.md with patterns
2. Create /test-with-validation command
3. Add pre-commit hook for edge cases
4. Build security-review subagent
Problem

Traditional engineering has diminishing returns; AI agents repeat mistakes because learnings aren't codified.

Solution

Codify all learnings from each feature into prompts, commands, and hooks to make subsequent features easier.

When to use
  • Building features with AI agents
  • Onboarding new team members
  • Agents repeating the same mistakes
Trade-offs

Pros

  • Accelerating productivity over time
  • Knowledge preserved beyond individuals
  • Better agent and human onboarding

Cons

  • Upfront documentation time
  • Risk of prompt bloat
  • Requires extensible agent system
View Original →

Reflection Loop

Feedback Loops
Flow
         ┌─────────────────────────┐
         │                         │
         ▼                         │
Prompt ──▶ Generate ──▶ Evaluate ──┤
              │           │        │
              ▼           ▼        │
           Draft    score < threshold?
                          │ yes    │
                          └────────┘
                          │ no
                          ▼
                       Return
flowchart TD A[Prompt] --> B[Generate Draft] B --> C[Evaluate] C --> D{score >= threshold?} D -->|No| E[Incorporate Feedback] E --> B D -->|Yes| F[Return Draft]
Example
for attempt in range(max_iters):
    draft = generate(prompt)
    score, critique = evaluate(draft)
    if score >= threshold:
        return draft
    prompt = incorporate(critique, prompt)
Problem

생성 모델이 자신의 출력을 검토하거나 비평하지 않으면 품질이 낮은 결과물을 생성할 수 있음

Solution

초안 생성 후 모델이 스스로 평가하고 피드백을 반영하여 반복적으로 개선

When to use
  • 품질이나 명시적 기준 준수가 중요할 때
  • 글쓰기, 추론, 코드 작성 작업
  • 자동 개선이 필요한 경우
  • 임계값 도달까지 반복 가능할 때
Trade-offs

Pros

  • 적은 감독으로 출력 품질 향상
  • 명시적 기준 대비 자가 평가 가능

Cons

  • 추가 연산 비용 발생
  • 평가 지표가 잘못 정의되면 개선 정체
View Original →

Skill Library Evolution

Learning & Adaptation
Flow
Ad-hoc ──▶ Save ──▶ Reusable ──▶ Documented ──▶ Capability
  │         │          │            │              │
  └─────────┴──────────┴────────────┴──────────────┘
                    skills/
              ├── sentiment.py
              ├── pdf_convert.py
              └── api_summary.py
graph LR A[Ad-hoc Code] --> B[Save Working Solution] B --> C[Reusable Function] C --> D[Documented Skill] D --> E[Agent Capability]
Concept
# Session 1: Save working solution
with open("skills/sentiment.py", "w") as f:
    f.write(working_code)

# Session N: Reuse existing skill
from skills.sentiment import analyze
result = analyze(text)  # no rediscovery
Problem

Agents solve similar problems across sessions but must rediscover solutions each time, wasting tokens

Solution

Persist working code as reusable functions in skills/ directory. Over time, evolve into documented, tested capabilities

When to use
  • Repetitive problem-solving across sessions
  • Organization wants agents to build capability over time
  • Code reuse is valuable
Trade-offs

Pros

  • Builds agent capability over time
  • Reduces token consumption

Cons

  • Requires discipline to organize
  • Skills can become stale
View Original →

Variance-Based RL Sample Selection

Learning
Flow
Score
1.0 ●━━━━●━━━━●━━━━●     ← Always correct (no learning)
    ┃
0.5 ┃  ●━━●━━●           ← HIGH VARIANCE (train here!)
    ┃  ┃  ▼
0.0 ●━━●━━━━●━━━━●━━━●   ← Always wrong (no learning)
    └──┴──┴──┴──┴──┴──▶
       Sample Index

Legend: ● best │ ━ mean │ ▼ variance range
flowchart LR A[Dataset] --> B[Run 3-5 Baselines] B --> C[Plot Variance] C --> D{Variance > 0?} D -->|Yes| E[Train on this] D -->|No| F[Skip sample]
Example
# Run multiple evals per sample
for sample in dataset:
    scores = [agent.eval(sample) for _ in range(3)]
    variance = np.var(scores)
    if variance > 0.01 and 0 < np.mean(scores) < 1:
        high_variance_samples.append(sample)
Problem

Zero-variance samples (always correct or always wrong) provide no learning signal, wasting compute in RL training.

Solution

Run multiple baseline evals per sample, plot variance, prioritize high-variance samples where model sometimes succeeds.

When to use
  • RL training with limited budget
  • Dataset may contain many solved/unsolvable samples
  • Need to estimate improvement potential
Trade-offs

Pros

  • Data efficiency - focus on learnable samples
  • Predictive - estimate potential before training

Cons

  • Upfront eval cost (3-5x baselines)
  • Variance changes during training
View Original →

Sub-Agent Spawning

Orchestration
Flow
                       ┌─── Sub1 ──▶ [████] ───┐
                       │                       │
Main ──▶ Split(n) ────┼─── Sub2 ──▶ [████] ───┼──▶ Merge ──▶ Done
                       │                       │
                       └─── Sub3 ──▶ [████] ───┘

Context:  Main[████████████]  →  Sub[██] Sub[██] Sub[██]
sequenceDiagram participant M as Main participant S1 as Sub 1 participant S2 as Sub 2 participant S3 as Sub 3 M->>M: Split (36 files) par Parallel M->>S1: 12 files M->>S2: 12 files M->>S3: 12 files end S1-->>M: done S2-->>M: done S3-->>M: done M->>M: Merge → PR
Example
main_agent.spawn_subagents(
    task="Refactor YAML front-matter",
    files=glob("*.md"),  # 36 files
    agents=3,
    per_agent=12
)
# Each subagent gets fresh context
# Main agent merges results
Problem

대규모 멀티파일 작업 시 메인 에이전트의 컨텍스트 윈도우가 폭발하여 추론 예산을 초과

Solution

독립된 서브에이전트를 생성하여 병렬로 분산 작업 후 결과 취합

When to use
  • 멀티파일 수정 (10개 이상)
  • 독립적으로 분할 가능한 작업
  • 순차 처리로는 너무 느릴 때
Trade-offs

Pros

  • 병렬 처리로 속도 향상
  • 컨텍스트 격리

Cons

  • 에이전트 간 조율 복잡도
  • 결과 병합 비용
View Original →

Action-Selector Pattern

Orchestration & Control
Flow
User Prompt ──▶ LLM ──▶ Action ID ──▶ Execute
                 │                        │
                 │    ┌───────────────────┘
                 │    │
                 ▼    ▼
            Allowlist Check    Tool Output
                              (NOT fed back!)

[LLM as decoder only - no feedback loop]
flowchart LR A[User Request] --> B[LLM Decoder] B --> C{Match Allowlist?} C -->|Yes| D[Execute Action] C -->|No| E[Reject] D --> F[Return Result] F -.->|NOT fed back| B
Example
allowlist = ["check_balance", "transfer", "history"]
action = llm.translate(user_prompt, allowlist)
result = execute(action)
# CRITICAL: result NOT returned to LLM
return result  # Direct to user
Problem

도구 피드백이 컨텍스트 윈도우로 재진입할 때 신뢰할 수 없는 입력이 에이전트 추론을 하이재킹할 수 있음

Solution

LLM을 요청을 사전 승인된 액션으로 매핑하는 데만 사용하고, 도구 출력은 모델에 피드백하지 않음

When to use
  • 고보안 환경
  • 고객 서비스 봇
  • 제한된 액션 세트 (키오스크, 라우터)
Trade-offs

Pros

  • 프롬프트 인젝션에 거의 면역
  • 감사가 매우 쉬움

Cons

  • 유연성 제한
  • 새 기능 추가 시 코드 수정 필요
View Original →

Autonomous Workflow Agent Architecture

Orchestration
Flow
Workflow ──▶ Container ──▶ tmux Sessions
                              │
            ┌─────────────────┼─────────────────┐
            ▼                 ▼                 ▼
       [Session 1]      [Session 2]      [Session 3]
            │                 │                 │
            └─────────┬───────┘                 │
                      ▼                         │
               Monitor & Wait ◀─────────────────┘
                      │
              Error? ─┼─▶ Retry/Recover
                      │
                      ▼
               Checkpoint ──▶ Next Step
flowchart TD A[Workflow] --> B[Container] B --> C[tmux Sessions] C --> D[Parallel Execution] D --> E[Intelligent Monitoring] E --> F{Error?} F -->|Yes| G[Adaptive Recovery] G --> C F -->|No| H[Checkpoint] H --> I{Complete?} I -->|No| C I -->|Yes| J[Results]
Example
class WorkflowAgent:
    def execute_workflow(self, steps):
        for step in steps:
            session = self.create_session(step.name)
            try:
                result = self.execute_step(step, session)
                self.create_checkpoint(step, result)
            except Exception as e:
                if self.can_retry(e):
                    self.retry_with_backoff(step)
                else:
                    self.escalate_to_human(step, e)
Problem

Complex engineering workflows require extensive human oversight, with manual coordination, monitoring, and intervention at each step.

Solution

Containerized agents with tmux session management, intelligent monitoring, and context-aware error recovery for autonomous multi-step execution.

When to use
  • Model training and evaluation pipelines
  • Infrastructure provisioning and configuration
  • Multi-stage deployment workflows
Trade-offs

Pros

  • 1.2-1.4x speedup in workflow execution
  • Reduced human intervention for routine steps
  • Comprehensive automatic logging

Cons

  • Limited handling of novel failure scenarios
  • Context window constraints for long workflows
  • Setup complexity for containers and monitoring
View Original →

Continuous Autonomous Task Loop

Orchestration & Control
Flow
┌──────────────────────────────────────────────┐
│              AUTONOMOUS LOOP                 │
│                                              │
│  Pick Task ──▶ Execute ──▶ Commit ──▶ Next  │
│      ▲                           │           │
│      └───────────────────────────┘           │
│                                              │
│  Rate Limit? ──▶ Exponential Backoff ──▶ ↺  │
└──────────────────────────────────────────────┘
sequenceDiagram loop Until done Script->>TaskAgent: Pick next task Script->>MainAgent: Execute autonomously Script->>GitAgent: Auto-commit alt Rate Limited Script->>Script: Backoff wait end end
Example
MAX_ITERATIONS=50
BACKOFF=300  # seconds

while [ $i -lt $MAX_ITERATIONS ]; do
  task=$(claude "Pick next from TODO.md")
  claude --auto-accept "$task"
  git add -A && git commit -m "$task"
done
Problem

Manual task orchestration—selection, commits, rate-limit handling—interrupts developer flow.

Solution

Implement continuous loop with fresh context per task, auto-commits, and intelligent backoff.

When to use
  • Batch processing TODO lists
  • Overnight autonomous development
  • Discrete, well-defined tasks
Trade-offs

Pros

  • Complete autonomy
  • Fresh context per task
  • Handles rate limits gracefully

Cons

  • Reduced human oversight
  • Elevated permission requirements
  • Risk of runaway execution
View Original →

Distributed Execution with Cloud Workers

Orchestration
Flow
                    ┌──▶ Worker1 [Worktree-A] ──┐
                    │                           │
Coordinator ──▶ ────┼──▶ Worker2 [Worktree-B] ──┼──▶ Merge ──▶ main
                    │                           │
                    └──▶ Worker3 [Worktree-C] ──┘

Tasks:  [████████] → [██] [██] [██] → [████████]
sequenceDiagram participant C as Coordinator participant W1 as Worker 1 participant W2 as Worker 2 participant W3 as Worker 3 C->>C: Decompose tasks par Parallel Execution C->>W1: Task A (worktree-1) C->>W2: Task B (worktree-2) C->>W3: Task C (worktree-3) end W1-->>C: PR ready W2-->>C: PR ready W3-->>C: PR ready C->>C: Merge all → main
Example
coordinator.deploy_workers(
    tasks=["refactor auth", "update API", "fix tests"],
    workers=3,
    git_worktrees=True
)
# Each worker: isolated worktree + Claude session
# Coordinator: monitors, merges PRs
Problem

Single-session AI agent execution cannot scale to meet enterprise team demands with multiple simultaneous code changes.

Solution

Run multiple Claude sessions in parallel using git worktrees and cloud workers with centralized coordination.

When to use
  • Team-wide code migrations
  • Parallel feature development
  • Large-scale infrastructure changes
Trade-offs

Pros

  • 10x-100x parallelization speedup
  • Scales to enterprise teams
  • Centralized monitoring

Cons

  • Significant infrastructure complexity
  • Merge conflict overhead
  • Higher parallel model costs
View Original →

Feature List as Immutable Contract

Orchestration & Control
Flow
feature-list.json (IMMUTABLE)
┌─────────────────────────────────┐
│ auth-001: [░] New chat button   │
│ auth-002: [░] Logout function   │◄── Agent can ONLY
│ ui-001:   [█] Dark mode         │    set passes=true
│ ui-002:   [░] Responsive nav    │
│ api-001:  [░] Rate limiting     │◄── Cannot DELETE
└─────────────────────────────────┘    or MODIFY

Agent ──▶ Implement ──▶ Test ──▶ [█] Mark Done
graph TD A[Feature List Created] --> B[All passes=false] B --> C{Select Next Feature} C --> D[Implement] D --> E[Test] E --> F{All Steps Pass?} F -->|No| D F -->|Yes| G[Set passes=true] G --> H{More Features?} H -->|Yes| C H -->|No| I[Complete]
Example
{
  "features": [{
    "id": "auth-001",
    "description": "New chat creates conversation",
    "steps": ["Click button", "Verify URL"],
    "passes": false  // Agent can ONLY flip to true
  }]
}
Problem

Long-running agents declare premature victory, delete tests to pass, or lose track of requirements across sessions.

Solution

Define all features upfront in an immutable JSON that agents can mark complete but cannot modify or delete.

When to use
  • Building complete applications with known requirements
  • Projects spanning many agent sessions
  • When agent accountability is critical
Trade-offs

Pros

  • Prevents premature completion claims
  • Eliminates 'pass by deletion' attacks

Cons

  • Requires upfront feature specification
  • Rigid for changing requirements
View Original →

Iterative Multi-Agent Brainstorming

Orchestration & Control
Flow
              ┌─── Agent A ──▶ Ideas 1 ───┐
              │                           │
Problem ──▶   ├─── Agent B ──▶ Ideas 2 ───┼──▶ Synthesize
              │                           │
              └─── Agent C ──▶ Ideas 3 ───┘

Perspectives: [Performance] [Security] [UX]
flowchart LR P[Problem] --> A1[Agent 1] P --> A2[Agent 2] P --> A3[Agent 3] A1 --> S[Synthesize] A2 --> S A3 --> S S --> R[Best Ideas]
Example
# Parallel brainstorming agents
ideas = await asyncio.gather(
    agent("Refactor for performance"),
    agent("Refactor for readability"),
    agent("Refactor for testability")
)
best = synthesize(ideas)
Problem

Single agent gets stuck in local optimum or fails to explore diverse solutions for complex problems.

Solution

Spawn multiple agents with different perspectives to brainstorm in parallel, then synthesize best ideas.

When to use
  • Creative ideation tasks
  • Complex refactoring decisions
  • Design exploration
Trade-offs

Pros

  • Diverse perspectives
  • Avoids local optima

Cons

  • Higher token cost
  • Synthesis complexity
View Original →

Multi-Model Orchestration for Complex Edits

Orchestration & Control
Flow
User Request ──▶ Retrieval Model ──▶ Generation Model ──▶ Apply Model ──▶ Done
                     │                    │                   │
                 (context)            (edits)            (patches)
flowchart LR A[User Request] --> B[Retrieval Model] B --> C[Generation Model] C --> D[Apply Model] D --> E[Edited Codebase]
Example
# Multi-model pipeline
context = retrieval_model.gather(codebase)
edits = generation_model.generate(
    context, user_request
)  # Claude Sonnet
apply_model.patch(edits)  # Custom
Problem

A single model may not be optimal for all sub-tasks in complex operations like multi-file code editing.

Solution

Pipeline multiple specialized models: retrieval model for context, generation model for edits, and application model for changes.

When to use
  • Multi-file code editing tasks
  • Complex operations requiring different skills
  • Tasks with distinct retrieval, generation, and application phases
Trade-offs

Pros

  • Leverages each model's strengths
  • More robust outcomes than single model

Cons

  • Orchestration complexity
  • Latency from multiple model calls
View Original →

Plan-Then-Execute Pattern

Orchestration & Control
Flow
         PLAN PHASE              EXECUTE PHASE
             │                        │
Prompt ──▶ [LLM] ──▶ Plan ──▶ [Controller] ──▶ Results
             │        │              │
             │    ┌───┴───┐     ┌────┴────┐
             │    │ call1 │     │ run(1)  │
             │    │ call2 │────▶│ run(2)  │
             │    │ call3 │     │ run(3)  │
             │    └───────┘     └─────────┘
             │                       │
          frozen               no plan changes
flowchart LR A[Prompt] --> B[LLM: Make Plan] B --> C[Frozen Plan] C --> D[Execute call 1] D --> E[Execute call 2] E --> F[Execute call 3] F --> G[Results]
Example
plan = LLM.make_plan(prompt)  # frozen
for call in plan:
    result = tools.run(call)
    stash(result)  # outputs isolated
# Tool outputs can't change which tools run
Problem

If tool outputs alter choice of later actions, injected instructions may redirect agent toward malicious steps

Solution

Split into Plan phase (fixed sequence before seeing untrusted data) and Execute phase (controller runs exact sequence)

When to use
  • Email & calendar bots
  • SQL assistants
  • Tasks where action set is known but params vary
Trade-offs

Pros

  • Strong control-flow integrity
  • 2-3x success rates for complex tasks

Cons

  • Output content can still be poisoned
  • Less flexible for dynamic tasks
View Original →

Self-Rewriting Meta-Prompt Loop

Orchestration
Flow
Episode ──▶ Reflect ──▶ Draft Delta ──▶ Validate ──┐
                                                    │
              ┌────────────────────────────────────-┘
              ▼
        [System Prompt v1] ──▶ [System Prompt v2] ──▶ Next Episode
flowchart LR A[Episode] --> B[Reflect] B --> C[Draft Edits] C --> D{Validate} D -->|Pass| E[Update Prompt] D -->|Fail| B E --> F[Next Episode]
Example
dialogue = run_episode()
delta = LLM("Propose prompt edits", dialogue)
if passes_guardrails(delta):
    system_prompt += delta
    save(system_prompt)
Problem

Static system prompts become stale as agents encounter new tasks and edge cases

Solution

Let the agent rewrite its own system prompt after each interaction through reflection and validation

When to use
  • Agent encounters recurring failures
  • Prompt needs frequent minor tweaks
  • Continuous learning without human intervention
Trade-offs

Pros

  • Rapid adaptation to new scenarios
  • No human in the loop for minor tweaks

Cons

  • Risk of drift or jailbreak
  • Requires strong guardrails
View Original →

Three-Stage Perception Architecture

Orchestration
Flow
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│  PERCEPTION   │   │  PROCESSING   │   │    ACTION     │
│───────────────│   │───────────────│   │───────────────│
│ Text/Image/   │   │  Reasoning    │   │ API Calls     │
│ Audio Input   │──▶│  Decision     │──▶│ File Ops      │
│ Normalization │   │  Validation   │   │ Notifications │
└───────────────┘   └───────────────┘   └───────────────┘
                            │
                            ▼
                    [ Feedback Loop ]
flowchart LR subgraph P[Perception] A[Raw Input] --> B[Normalize] end subgraph R[Processing] B --> C[Reason] --> D[Decide] end subgraph X[Action] D --> E[Execute] end E --> F[Feedback] F --> C
Example
class ThreeStageAgent:
    async def run(self, raw_input):
        # Stage 1: Perception
        data = await self.perception.process(raw_input)
        # Stage 2: Processing
        decisions = await self.processor.analyze(data)
        # Stage 3: Action
        return await self.action.execute(decisions)
Problem

Monolithic agents mixing perception, reasoning, and action are hard to debug, extend, or scale independently.

Solution

Separate workflow into three stages: Perception (input normalization), Processing (reasoning), and Action (execution).

When to use
  • Complex multi-modal inputs
  • Need independent component scaling
  • Team collaboration per stage
Trade-offs

Pros

  • Clean separation of concerns
  • Better error isolation

Cons

  • Additional complexity for simple tasks
  • Latency from stage transitions
View Original →

Agent-Driven Research

Orchestration & Control
Flow
Question ──▶ Formulate Query ──▶ Search ──▶ Analyze
                    ▲                          │
                    │    Sufficient? ◀─────────┘
                    │         │
                    │    No   │ Yes
                    └─────────┘   │
                                  ▼
                            Synthesize & Present
flowchart TD A[Research Question] --> B[Formulate Query] B --> C[Execute Search] C --> D[Analyze Results] D --> E{Sufficient?} E -->|No| F[Refine Strategy] F --> B E -->|Yes| G[Synthesize Findings]
Example
while not agent.satisfied():
    query = agent.formulate_query(question)
    results = search_tool.execute(query)
    agent.analyze(results)
    agent.refine_strategy()
return agent.synthesize_findings()
Problem

전통적인 리서치는 새로운 결과에 기반해 검색 전략을 동적으로 조정하는 능력이 부족함

Solution

에이전트가 독립적으로 쿼리를 생성하고, 검색을 실행하고, 결과를 분석하고, 만족할 때까지 반복하도록 함

When to use
  • 복잡한 리서치 질문
  • 다중 소스 정보 수집
  • 적응적 검색이 필요할 때
Trade-offs

Pros

  • 적응적 검색 전략
  • 자율적인 반복

Cons

  • 명백한 소스를 놓칠 수 있음
  • 반복에 따른 토큰 비용
View Original →

Discrete Phase Separation

Orchestration & Control
Flow
┌──────────┐    Findings    ┌──────────┐    Plan    ┌──────────┐
│ Research │ ────────────▶  │ Planning │ ────────▶  │ Execute  │
│  (Opus)  │                │  (Opus)  │            │ (Sonnet) │
└──────────┘                └──────────┘            └──────────┘
     ↓                           ↓                       ↓
[Fresh ctx]               [Fresh ctx]              [Fresh ctx]

Key: Pass distilled conclusions, NOT full history
flowchart LR A[Research
Opus] -->|Distilled findings| B[Planning
Opus] B -->|Implementation plan| C[Execute
Sonnet]
Example
# Phase 1: Research (new conversation)
findings = opus.research("OAuth flows in codebase")

# Phase 2: Plan (new conversation)
plan = opus.plan(findings, "Add Google OAuth")

# Phase 3: Execute (new conversation)
sonnet.implement(plan, step=1)
Problem

Simultaneous research, planning, and implementation causes context contamination and degraded output.

Solution

Break workflow into isolated phases (Research, Plan, Execute) with clean handoffs of distilled conclusions.

When to use
  • Complex features needing background research
  • Refactoring projects with architectural decisions
  • Mixing research and implementation hurts quality
Trade-offs

Pros

  • Higher quality per phase
  • Prevents context contamination
  • Leverages model-specific strengths

Cons

  • Requires explicit phase management
  • Feels slower for simple tasks
  • Higher total token usage
View Original →

Dual LLM Pattern

Orchestration
Flow
┌─────────────────┐      ┌─────────────────┐
│  Quarantined    │      │   Privileged    │
│     LLM         │      │      LLM        │
│                 │      │                 │
│ [Reads Data]    │ $VAR │ [Plans + Tools] │
│ [No Tools]      │ ───▶ │ [No Raw Data]   │
└─────────────────┘      └────────┬────────┘
        ▲                         │
        │                         ▼
   Untrusted              execute(plan, $VAR)
     Input
flowchart LR Input[Untrusted Text] --> Q[Quarantined LLM] Q -->|$VAR1| P[Privileged LLM] P -->|plan| E[Execute] E -->|subst $VAR1| Action Q -.-x|NO| Tools P -.-x|NO| RawData[Raw Data]
Example
var1 = QuarantineLLM("extract email", text)
# Returns symbolic: $VAR1

plan = PrivLLM.plan("send $VAR1 to boss")
# No raw text exposure

execute(plan, subst={"$VAR1": var1})
Problem

A privileged agent that sees untrusted text AND wields tools can be coerced into dangerous calls.

Solution

Split into Privileged LLM (plans/tools, no raw data) and Quarantined LLM (reads data, no tools), passing data as symbolic variables.

When to use
  • Email/calendar assistants
  • Booking agents handling user data
  • API-powered chatbots
Trade-offs

Pros

  • Clear trust boundary
  • Compatible with static analysis
  • Prevents injection attacks

Cons

  • Increased complexity
  • Debugging across two models
  • Variable mapping overhead
View Original →

Inference-Time Scaling

Orchestration & Control
Flow
Problem ──▶ Assess Difficulty
                    │
      ┌─────────────┼─────────────┐
      ▼             ▼             ▼
    [Low]       [Medium]       [High]
      │             │             │
  Standard    N Attempts    Deep Search
      │             │             │
      └─────────────┴─────────────┘
                    │
               Select Best ──▶ Answer
flowchart TD A[Problem] --> B{Difficulty?} B -->|Low| C[Standard] B -->|Medium| D[N Attempts] B -->|High| E[Deep Reasoning] C & D & E --> F[Select Best] F --> G[Answer]
Example
def solve_with_scaling(problem, budget=100):
    difficulty = estimate_difficulty(problem)
    if difficulty < 0.3:
        return standard_inference(problem)
    elif difficulty < 0.7:
        return best_of_n(problem, n=5)
    else:
        return deep_reasoning_with_search(problem)
Problem

Once trained, model performance is fixed. We cannot 'think harder' by allocating more compute for challenging problems.

Solution

Allocate additional inference compute dynamically: generate multiple candidates, perform extended reasoning, iterate and refine outputs.

When to use
  • Complex reasoning tasks
  • Problems with verifiable solutions
  • When latency is acceptable
Trade-offs

Pros

  • Dramatically improves hard tasks
  • More cost-effective than larger models

Cons

  • Increased latency
  • Higher inference costs
View Original →

Language Agent Tree Search (LATS)

Orchestration & Control
Flow
        [Root]
       /   |   \
     [A]  [B]  [C]    ← Expand candidates
      |    |
    [A1] [B1]         ← UCB select best
      |
    [A1a]             ← Evaluate & backprop

Select ──▶ Expand ──▶ Evaluate ──▶ Backprop
flowchart TB R[Root] --> A[Path A] R --> B[Path B] A --> A1[A.1] A1 --> A2[A.2 Best] B --> B1[B.1] style A2 fill:#90EE90
Example
def search(root, iterations=50):
    for _ in range(iterations):
        node = select(root)  # UCB
        children = expand(node)  # LLM
        value = evaluate(node)  # LLM
        backpropagate(node, value)
    return best_path(root)
Problem

Linear reasoning (ReACT) gets stuck in local optima on complex tasks requiring strategic exploration.

Solution

Apply Monte Carlo Tree Search (MCTS) with LLM for action generation, evaluation, and backpropagation.

When to use
  • Multi-step reasoning
  • Strategic planning
  • Multiple valid approaches
Trade-offs

Pros

  • Systematic exploration
  • Outperforms ReACT

Cons

  • High LLM call cost
  • Parameter tuning
View Original →

Opponent Processor / Multi-Agent Debate

Orchestration & Control
Flow
Task ──┬──▶ [Advocate] ──▶ Propose ──┐
       │                             │
       └──▶ [Critic]   ──▶ Challenge ┼──▶ Debate ──▶ Synthesis
                                     │
                              (Iterate)
flowchart LR A[Task] --> B[Advocate] A --> C[Critic] B --> D[Debate] C --> D D --> E[Synthesized Decision]
Example
# Debate pattern for expenses
advocate = Agent(role="user_advocate")
auditor = Agent(role="company_auditor")

proposal = advocate.classify(expense)
challenge = auditor.review(proposal)
final = synthesize(proposal, challenge)
Problem

Single-agent decisions suffer from confirmation bias, limited perspectives, and unexamined assumptions.

Solution

Spawn opposing agents with different goals to debate each other, surfacing blind spots and unconsidered alternatives.

When to use
  • Decisions requiring balanced perspectives
  • High-stakes choices needing scrutiny
  • Tasks prone to confirmation bias
Trade-offs

Pros

  • Reduces bias through adversarial pressure
  • Surfaces blind spots and trade-offs

Cons

  • 2x+ token cost
  • May deadlock without resolution mechanism
View Original →

Progressive Autonomy with Model Evolution

Orchestration & Control
Flow
Model v1 ──▶ Heavy Scaffolding ──▶ [2000 tokens prompt]
    │
    ▼ (model upgrade)
Model v2 ──▶ Audit & Remove ──▶ [500 tokens prompt]
    │
    ▼ (model upgrade)
Model v3 ──▶ Minimal Prompt ──▶ [100 tokens prompt]
flowchart LR A[Model v1] --> B[Heavy Scaffolding] B --> C[Model v2 Released] C --> D[Remove Unnecessary] D --> E[Model v3 Released] E --> F[Minimal Prompt]
Example
# Before: 2000 tokens of instructions
system_prompt_v1 = """Check file exists, read contents,
plan changes, make minimal edits, verify syntax..."""

# After: Model internalized the steps
system_prompt_v2 = "Write clean, tested code."
Problem

Agent scaffolding built for older models becomes unnecessary overhead as models improve, creating prompt bloat, wasted tokens, and maintenance burden.

Solution

Actively remove scaffolding as models become more capable. Regularly audit system prompts and orchestration logic to eliminate what newer models have internalized.

When to use
  • New model releases available
  • System prompts exceeding necessary length
  • Complex orchestration for simple tasks
Trade-offs

Pros

  • Reduced token costs
  • Faster execution
  • Simpler maintenance

Cons

  • Requires testing for quality validation
  • May need different configs per model version
  • Loss of explicit control
View Original →

Specification-Driven Agent Development

Orchestration
Flow
Spec File ──▶ Parse ──▶ Task Graph ──▶ Scaffold
   (MD/JSON)                            │
                                        ▼
                        Generated Code ◀── [links to spec clause]
flowchart LR A[Spec File] --> B[Parse Spec] B --> C[Build Task Graph] C --> D[Scaffold Code] D --> E[Link to Spec Clauses] E --> F[Iterate via Spec Edits]
Example
if new_feature_requested:
    write_spec(update)
    agent.sync_with(spec)
# All artifacts link back to spec clauses
Problem

Hand-crafted prompts leave room for ambiguity; agents can over-interpret or conflict with stakeholder intent

Solution

Use a formal spec file as the agent's primary input and source of truth, iterating only by editing the spec

When to use
  • Complex multi-step feature development
  • Audit-friendly, repeatable workflows
  • Team collaboration requiring clear contracts
Trade-offs

Pros

  • Repeatable, audit-friendly, easy diffing
  • Clear artifact traceability to spec

Cons

  • Up-front spec writing effort
  • Ramp-up for teams new to spec formats
View Original →

Tool Capability Compartmentalization

Orchestration
Flow
        Monolithic Tool (RISKY)          Compartmentalized (SAFE)
        ┌─────────────────────┐           ┌─────────┐
        │  read + fetch +     │           │ READER  │ fs:read-only
        │  write ALL-IN-ONE   │    ──▶    └────┬────┘
        │  🔓 max surface     │                │  consent
        └─────────────────────┘           ┌────▼────┐
                                          │PROCESSOR│ net:allowlist
                                          └────┬────┘
                                               │  consent
                                          ┌────▼────┐
                                          │ WRITER  │ scoped perms
                                          └─────────┘
flowchart LR subgraph Old[Monolithic] A[Read+Fetch+Write] end subgraph New[Compartmentalized] R[Reader] -->|consent| P[Processor] P -->|consent| W[Writer] end Old --> New
Example
# tool-manifest.yml
email_reader:
  capabilities: [private_data, untrusted_input]
  permissions:
    fs: read-only:/mail
    net: none
issue_creator:
  capabilities: [external_comm]
  permissions:
    net: allowlist:github.com
Problem

Mix-and-match tools combining data readers, web fetchers, and writers amplify prompt-injection attack chains.

Solution

Split tools into reader/processor/writer micro-tools with isolated permissions and explicit per-call consent.

When to use
  • Tools handle private data
  • Multiple capability types combined
  • Security is high priority
Trade-offs

Pros

  • Fine-grained security control
  • Plays well with modular architectures

Cons

  • More tooling overhead
  • Permission creep over time
View Original →

Disposable Scaffolding Over Durable Features

Orchestration & Control
Flow
┌─────────────────────────────────────────────────┐
│           THE BITTER LESSON CYCLE               │
│                                                 │
│  New Model ──▶ Eval Scaffolding ──▶ Obsolete?  │
│       ▲               │                │        │
│       │          Still needed?     Discard      │
│       │               │                │        │
│       │          ──▶ Keep minimal ◀────┘        │
│       └─────────────────────────────────────────│
│                                                 │
│  Mindset: Build simple, throw away, repeat     │
└─────────────────────────────────────────────────┘
flowchart TD A[New Model Release] --> B{Eval Scaffolding} B -->|Obsolete| C[Discard] B -->|Still needed| D[Keep minimal] C --> E[Rebuild lightweight] D --> E E --> F[Focus on core value] F --> A
Example
# Instead of 3-month robust solution:
def quick_context_compressor(text):
    """Expect this to be obsolete by Q2"""
    return simple_summarize(text)

# Focus engineering on unique value
# that won't "fall into the model"
Problem

Complex features built around models become obsolete when next-gen models perform those tasks natively.

Solution

Treat tooling as temporary scaffolding; build simplest solutions expecting they will be discarded.

When to use
  • Building model-compensating features
  • Rapid model improvement cycles
  • Need product agility over long-term investment
Trade-offs

Pros

  • Fast adaptation to new models
  • Lower engineering investment risk
  • Focus on unique value, not workarounds

Cons

  • May feel wasteful
  • Requires discipline to avoid over-engineering
  • Tech debt accumulates if not cleaned
View Original →

Explicit Posterior-Sampling Planner

Orchestration
Flow
┌─────────────────────────────────────┐
│         PSRL Loop                   │
└─────────────────────────────────────┘

Posterior ──▶ Sample Model ──▶ Compute Plan
    ▲                              │
    │                              ▼
    └──── Update ◀── Observe ◀── Execute

P(model|data) → sample → plan → act → reward
flowchart LR P[Posterior P(M|D)] --> S[Sample Model] S --> C[Compute Plan] C --> E[Execute] E --> O[Observe Reward] O --> U[Update Posterior] U --> P
Example
# PSRL-based agent loop
posterior = init_prior(task_models)

while not done:
    model = posterior.sample()
    plan = compute_optimal_plan(model)
    reward = execute(plan)
    posterior.update(observation, reward)
    # Natural language: LLM fills each step
Problem

Agents relying on ad-hoc heuristics explore poorly, wasting tokens and API calls on dead ends.

Solution

Embed PSRL algorithm: maintain Bayesian posterior over task models, sample model, compute optimal plan, execute, update posterior.

When to use
  • Exploration-heavy tasks
  • Multi-step decision making
  • Need principled exploration
Trade-offs

Pros

  • Principled exploration
  • Efficient token usage
  • Theoretical guarantees

Cons

  • Implementation complexity
  • Posterior computation cost
  • Requires RL expertise
View Original →

Initializer-Maintainer Dual Agent

Orchestration & Control
Flow
┌─── ONCE ─────────────────────────────────┐
│ Initializer ──▶ features.json            │
│              ──▶ init.sh                 │
│              ──▶ progress.txt            │
│              ──▶ First Commit            │
└──────────────────────────────────────────┘
                    ▼
┌─── EACH SESSION ─────────────────────────┐
│ Maintainer ──▶ Read git/progress         │
│            ──▶ Select next feature       │
│            ──▶ Implement + Test          │
│            ──▶ Commit + Update progress  │
└──────────────────────────────────────────┘
sequenceDiagram participant Init as Initializer participant FS as Filesystem participant Code as Coding Agent Note over Init: Runs ONCE Init->>FS: feature-list.json Init->>FS: init.sh, progress.txt Note over Code: Runs EACH session loop Session N Code->>FS: Read progress Code->>Code: Implement feature Code->>FS: Update & commit end
Example
# Initializer creates foundation
project/
  feature-list.json  # All features, passes=false
  progress.txt       # Running log
  init.sh            # One-command bootstrap

# Maintainer session ritual
$ ./init.sh && read_progress && implement_next
Problem

Single-agent approaches either over-engineer each session (wasting setup time) or under-invest in foundations (causing drift and confusion).

Solution

Use two specialized agents: Initializer creates foundations once (features, env, tracking); Maintainer handles incremental development across sessions.

When to use
  • Projects requiring many sessions
  • Complex applications with 50+ features
  • When context loss is costly
Trade-offs

Pros

  • Clear separation of setup vs execution
  • Prevents context loss across sessions

Cons

  • Requires upfront specification
  • Two configs to maintain
View Original →

LLM Map-Reduce Pattern

Orchestration & Control
Flow
        MAP                    REDUCE
[Doc1] ──▶ Sandbox LLM ──┐
[Doc2] ──▶ Sandbox LLM ──┼──▶ Aggregate ──▶ Result
[Doc3] ──▶ Sandbox LLM ──┘
           (isolated)    (safe summaries only)
flowchart LR D1[Doc 1] --> S1[Sandbox] D2[Doc 2] --> S2[Sandbox] D3[Doc 3] --> S3[Sandbox] S1 --> R[Reduce] S2 --> R S3 --> R R --> O[Output]
Example
results = []
for doc in untrusted_docs:
    # Sandboxed: constrained output
    ok = sandbox_llm("Invoice? yes/no", doc)
    results.append(ok)
# Reduce: no raw docs here
final = reduce(results)
Problem

Single poisoned document can manipulate global reasoning if all data processed in one context.

Solution

Map: sandboxed LLMs process each doc independently. Reduce: aggregate only sanitized outputs.

When to use
  • Processing untrusted docs
  • File triage/classification
  • N-to-1 decisions
Trade-offs

Pros

  • Poisoned item isolated
  • Scalable parallelism

Cons

  • Output validation needed
  • Orchestration overhead
View Original →

Oracle and Worker Multi-Model

Orchestration & Control
Flow
Request ──▶ Worker(Sonnet) ──┬──▶ Execute ──▶ Done
                             │
                        [Stuck?]
                             │
                             └──▶ Oracle(o3) ──▶ Strategy ──▶ Worker
flowchart LR A[Request] --> B[Worker] B --> C{Stuck?} C -->|No| D[Execute] C -->|Yes| E[Oracle] E --> F[Strategy] F --> B D --> G[Done]
Example
worker = Agent(model="sonnet-4")
oracle = Agent(model="o3")

result = worker.execute(task)
if worker.is_stuck():
    strategy = oracle.consult(context)
    result = worker.execute(strategy)
Problem

Single model creates trade-off between capability and cost. Powerful models are expensive for routine tasks.

Solution

Two-tier system: fast Worker (Sonnet) handles bulk tasks, expensive Oracle (o3/Gemini) reserved for high-level reasoning and debugging.

When to use
  • Mix of routine and complex tasks
  • Cost optimization is important
  • Tasks where worker may get stuck
Trade-offs

Pros

  • Cost-efficient use of frontier models
  • Specialized AI team approach

Cons

  • Orchestration complexity
  • Latency from model switching
View Original →

Progressive Complexity Escalation

Orchestration & Control
Flow
Tier 1: Research ──▶ Present to Human
           │ (proven reliable)
           ▼
Tier 2: Research ──▶ Draft ──▶ Human Approves
           │ (proven reliable)
           ▼
Tier 3: Research ──▶ Draft ──▶ Auto-Send (if conf > 0.8)
flowchart TD A[Tier 1: Info Gathering] --> B{Reliable?} B -->|Yes| C[Tier 2: Draft + Approval] C --> D{Reliable?} D -->|Yes| E[Tier 3: Autonomous]
Example
class AgentCapabilities:
    def process(self, data):
        research = self.research(data)  # Tier 1
        if self.tier >= 2:
            draft = self.generate(research)
            if self.tier >= 3 and self.confidence > 0.8:
                return self.auto_execute(draft)
            return self.request_approval(draft)
        return self.present_findings(research)
Problem

Deploying agents with overly ambitious capabilities from day one leads to unreliable outputs, failed implementations, and safety risks from autonomous high-stakes operations.

Solution

Start with low-complexity, high-reliability tasks and progressively unlock more complex capabilities as models improve and trust is established through capability tiers.

When to use
  • Deploying agents into production
  • High-stakes or regulated domains
  • Building internal automation tools
Trade-offs

Pros

  • Risk mitigation via limited blast radius
  • Builds stakeholder confidence
  • Graceful degradation

Cons

  • Delayed full automation benefits
  • Tier management complexity
  • Promotion friction
View Original →

Stop Hook Auto-Continue Pattern

Orchestration
Flow
Agent Turn ──▶ Stop Hook ──▶ Run Tests ──┐
                                         │
          ┌──────────────────────────────┘
          │ Fail?
          ▼
   Auto-Continue ──▶ Agent Fixes ──▶ [loop until pass]
flowchart LR A[Agent Turn End] --> B[Stop Hook] B --> C{Tests Pass?} C -->|Yes| D[Return to User] C -->|No| E[Continue Agent] E --> A
Example
// hooks config
{
  "on_stop": {
    "command": "./check_tests.sh",
    "auto_continue_on_failure": true
  }
}
Problem

Agents complete turns even when tasks are not truly done (tests fail, checks incomplete)

Solution

Use stop hooks to check success criteria after each turn; auto-continue agent until criteria pass

When to use
  • Test-driven development workflows
  • Autonomous task completion required
  • Sandboxed/containerized environments
Trade-offs

Pros

  • True task completion guaranteed
  • No manual re-prompting needed

Cons

  • Risk of infinite loops
  • Runaway costs without timeout
View Original →

Tree-of-Thought Reasoning

Orchestration
Flow
                    [Problem]
                        │
            ┌───────────┼───────────┐
            ▼           ▼           ▼
        [Step A]    [Step B]    [Step C]
           │           │           │
       ┌───┴───┐   ┌───┴───┐       ✗
       ▼       ▼   ▼       ▼
    [A1]    [A2] [B1]    [B2]  ← evaluate
       │       ✗   ✓       ✗
       ▼           │
    [A1.1]    [best path]
flowchart TD P[Problem] --> A[Step A] P --> B[Step B] P --> C[Step C] A --> A1[A1] --> A11[A1.1] A --> A2[A2] B --> B1[B1 Best] B --> B2[B2] C --> X[Pruned]
Example
queue = [root_problem]
while queue:
    thought = queue.pop()
    for step in expand(thought):
        score = evaluate(step)
        queue.push((score, step))
return select_best(queue)
Problem

Linear chain-of-thought reasoning gets stuck on complex problems, missing alternatives or failing to backtrack.

Solution

Explore a search tree of intermediate thoughts, expand multiple steps, evaluate partial solutions before committing.

When to use
  • Complex puzzles or planning tasks
  • Multiple valid approaches exist
  • Backtracking may be needed
Trade-offs

Pros

  • Covers more possibilities
  • Improves reliability on hard tasks

Cons

  • Higher compute cost
  • Needs good scoring method
View Original →

Inversion of Control

Orchestration & Control
Flow
Human: "Refactor UploadService to async"
       │
       ▼
┌─────────────────────────────────────────┐
│          Agent Decides HOW              │
│  ┌─────┐   ┌─────┐   ┌─────┐           │
│  │grep │──▶│edit │──▶│test │──▶ ...    │
│  └─────┘   └─────┘   └─────┘           │
│      (Agent orchestrates 87%)           │
└─────────────────────────────────────────┘
       │
       ▼
Human: Review PR (3%)
sequenceDiagram Dev->>Agent: "Refactor UploadService to async" Agent->>Repo: git grep "UploadService" Agent->>Tools: edit_file Agent->>Tools: run_tests Agent-->>Dev: PR with green CI
Example
# Human provides goal, not steps
agent.run(
    goal="Refactor UploadService to async",
    tools=[grep, edit_file, run_tests, git],
    guardrails=["no prod DB access", "tests must pass"]
)
# Agent decides: grep -> analyze -> edit -> test -> PR
Problem

Traditional 'prompt-as-puppeteer' workflows force humans to spell out every step, limiting scale and creativity.

Solution

Give the agent tools plus a high-level goal and let it decide orchestration. Humans supply guard-rails (10%) while agent handles execution (87%).

When to use
  • Complex tasks with unclear sequence
  • Agent has necessary tools
  • Human oversight available
Trade-offs

Pros

  • Scales without step-by-step prompting
  • Unleashes agent creativity

Cons

  • Requires trust in agent judgment
  • May take unexpected paths
View Original →

Parallel Tool Call Learning

Orchestration & Control
Flow
BEFORE (Sequential):  T1 ──▶ T2 ──▶ T3 ──▶ T4  (7s)

AFTER (RL-Learned Parallel):
    ┌── T1 ──┐
    ├── T2 ──┼──▶ Results ──▶ T4 ──▶ Done  (3.5s)
    └── T3 ──┘
flowchart LR A[Agent] --> B[Batch 1: Parallel] B --> C[T1] B --> D[T2] B --> E[T3] C & D & E --> F[Aggregate] F --> G[Batch 2] G --> H[Done]
Example
# RL learns parallel patterns naturally
# No explicit parallelization code needed
job = client.fine_tuning.jobs.create(
    model="gpt-4o",
    method="rft",
    rft={"tools": tools, "grader": grader}
)
# Model discovers: batch independent calls
Problem

Agents execute tool calls sequentially even when they could run in parallel, causing unnecessary latency.

Solution

Use Agent RFT to teach models to parallelize independent tool calls, reducing latency by 40-50% when tool execution is fast.

When to use
  • Tool execution faster than inference
  • Independent information gathering
  • Broad exploration phases
Trade-offs

Pros

  • 40-50% latency reduction
  • Emerges naturally from RL training

Cons

  • Requires concurrent tool infrastructure
  • Higher peak resource usage
View Original →

Swarm Migration Pattern

Orchestration
Flow
Main ──▶ Scan (100 files) ──▶ Todo List ──▶ Spawn 10 Agents
                                              │
     ┌─────────────────────────────────────-──┘
     │  [10 files each, parallel]
     ▼
 Verify All ──▶ Consolidated PR
flowchart TD A[Main Agent] --> B[Scan Codebase] B --> C[Create Todo: 100 files] C --> D[Spawn 10 Subagents] D --> E1[Agent 1: 1-10] D --> E2[Agent 2: 11-20] D --> E3[...] E1 --> F[Verify] E2 --> F F --> G[Merged PR]
Example
files = find("*.test.js", old_framework)
for batch in chunk(files, 10):
    spawn_agent(
        task=f"Migrate {batch} to new framework",
        auto_commit=True
    )
Problem

Large-scale code migrations (framework upgrades, lint fixes) are slow when done sequentially

Solution

Main agent orchestrates 10+ parallel subagents, each migrating a batch of files in map-reduce fashion

When to use
  • Framework migrations (Jest to Vitest, etc.)
  • Lint rule rollouts across many files
  • API updates or code modernization
Trade-offs

Pros

  • 10x+ speedup via parallelization
  • Fault isolation per batch

Cons

  • High token cost for parallel agents
  • Potential merge conflicts
View Original →

Conditional Parallel Tool Execution

Orchestration & Control
Flow
Tools ──▶ Classify ──┬── Read-Only? ── Parallel ──┐
                     │                            │
                     └── Has-Write? ── Sequential ┴──▶ Results
flowchart LR A[Tool Batch] --> B{All Read-Only?} B -->|Yes| C[Execute Parallel] B -->|No| D[Execute Sequential] C --> E[Results] D --> E
Example
def execute_batch(tools):
    if all(t.is_read_only for t in tools):
        return parallel_execute(tools)
    else:
        return sequential_execute(tools)

# FileRead, Grep: parallel
# FileWrite: sequential
Problem

Sequential tool execution causes delays for read-only operations, but parallel execution risks race conditions for state-modifying tools.

Solution

Classify tools as read-only or state-modifying. Execute read-only batches in parallel, serialize state-modifying operations.

When to use
  • Mix of read and write operations
  • Performance-sensitive applications
  • Batch tool calls in single reasoning step
Trade-offs

Pros

  • Fast parallel reads, safe sequential writes
  • Prevents race conditions

Cons

  • Single write in batch serializes everything
  • Relies on accurate tool classification
View Original →

Anti-Reward-Hacking Grader Design

Reliability & Eval
Flow
Answer ──▶ ┌──────────────────┐
           │  Multi-Criteria  │
           │     Grader       │
           └────────┬─────────┘
                    │
     ┌──────────────┼──────────────┐
     ▼              ▼              ▼
Correctness    Reasoning     Citations
  (0.50)        (0.20)        (0.10)
     │              │              │
     └──────────────┴──────────────┘
                    │
              Weighted Sum ──▶ Final Score
flowchart TD A[Answer + Trace] --> B{Gaming Pattern?} B -->|Yes| C[Score: 0.0] B -->|No| D[Multi-Criteria Eval] D --> E[Correctness 0.50] D --> F[Reasoning 0.20] D --> G[Citations 0.10] E & F & G --> H[Weighted Sum] H --> I[Final Score]
Example
scores = {
    'correctness': 0.50,
    'reasoning': 0.20,
    'completeness': 0.15,
    'citations': 0.10,
    'formatting': 0.05
}
# Check gaming patterns first
if gaming_pattern_detected(trace):
    return {"score": 0.0, "violation": True}
# Multi-criteria weighted score
final = sum(w * scores[k] for k, w in weights)
Problem

RL models exploit grader loopholes to maximize reward without actually solving tasks, leading to 100% validation scores but poor real performance.

Solution

Design multi-criteria graders with iterative hardening, weighted subscores, and explicit gaming pattern detection.

When to use
  • Training agents with reinforcement learning
  • Initial grader shows suspiciously high scores
  • Production performance doesn't match validation metrics
Trade-offs

Pros

  • Robust learning - models solve tasks, not game metrics
  • Better generalization via multi-criteria evaluation
  • Debuggable subscores for identifying struggles

Cons

  • Engineering effort for careful design and iteration
  • Slower convergence due to harder grading
  • Computational cost for multi-criteria evaluation
View Original →

CLI-First Skill Design

Tool Use
Flow
               ┌── Human: $ skill.sh list
               │
Skill Logic ──▶ CLI ──┼── Agent: Bash("skill.sh list")
               │
               └── Cron: */5 * * * * skill.sh sync

Unix Philosophy: One tool ──▶ One task ──▶ Compose with pipes
flowchart LR A[Skill Logic] --> B[CLI Interface] B --> C[Human: Terminal] B --> D[Agent: Bash Tool] B --> E[Scripts: Automation] B --> F[Cron: Scheduled]
Example
#!/bin/bash
# trello.sh - CLI-first skill
case "$1" in
  boards) curl -s "$API/boards" ;;
  cards)  curl -s "$API/boards/$2/cards" ;;
  create) curl -X POST "$API/cards" -d "name=$3" ;;
esac

# Human: trello.sh boards | jq '.name'
# Agent:  Bash("trello.sh cards abc123")
Problem

API-first는 디버깅 어렵고, GUI-first는 에이전트가 사용 불가. 두 인터페이스를 따로 만들어야 하는 부담

Solution

모든 스킬을 CLI 도구로 설계. 사람은 터미널에서, 에이전트는 Bash 도구로 동일하게 사용 가능

When to use
  • 사람과 에이전트 모두 사용할 스킬 개발 시
  • 디버깅/테스트가 용이해야 할 때
  • Unix 도구와 파이프 조합이 필요할 때
  • 런타임 의존성 최소화 필요 시
Trade-offs

Pros

  • 사람/에이전트 동시 사용
  • 쉬운 디버깅 및 테스트
  • Unix 도구와 조합 가능

Cons

  • 복잡한 데이터 구조 처리 어려움
  • 프로세스 생성 오버헤드
  • Windows 호환성 제한
View Original →

CriticGPT-Style Evaluation

Reliability & Eval
Flow
Generator ──▶ Code ──▶ CriticGPT ──▶ Issues?
                              │
                    ┌─────────┴─────────┐
                    │                   │
                   Yes                  No
                    │                   │
              ◀── Refine           ──▶ Human Review
sequenceDiagram Generator->>Critic: Submit code loop Until passes Critic->>Critic: Bug + Security scan Critic->>Generator: Issues found Generator->>Critic: Refined code end Critic->>Human: Present for review
Example
critic = CriticGPT(severity_threshold=0.7)
review = critic.review_code(code)

if review['issues']:
    code = generator.refine(code, review)
else:
    submit_to_human(code, review)
Problem

Human reviewers struggle to catch subtle bugs in sophisticated AI-generated code at scale.

Solution

Deploy specialized critic models trained for code review to identify bugs, security issues, and quality problems.

When to use
  • High volume of AI-generated code
  • Security-critical applications
  • Need consistent quality standards
Trade-offs

Pros

  • Catches subtle bugs humans miss
  • Consistent 24/7 reviews
  • Scalable code review process

Cons

  • False positives need human verification
  • Cannot understand full business context
  • May miss novel vulnerability types
View Original →

Extended Coherence Work Sessions

Reliability & Eval
Flow
Coherence Window Evolution:

Early:  [██░░░░░░░░░░░░░░░░░░]  ~5 min
        └─ loses track quickly

Current: [██████████████░░░░░░]  ~hours
         └─ sustained focus

Future:  [████████████████████]  all-day
         └─ human-equivalent
gantt title Agent Coherence Over Time dateFormat X axisFormat %s section Early Models Short coherence :done, 0, 300 section Current Extended coherence :active, 300, 10800 section Future All-day coherence :future, 10800, 86400
Example
# Coherence doubles every 7 months
agent = ExtendedCoherenceAgent(
    context_window="200K tokens",
    state_management="persistent",
    session_length="hours"
)
# Can now handle multi-hour projects
Problem

Early AI agents lose coherence after a few minutes, limiting their utility for complex multi-stage tasks requiring sustained effort.

Solution

Use models and architectures designed to maintain coherence over hours through larger context windows and better state management.

When to use
  • Complex multi-step projects (hours of work)
  • Tasks requiring sustained context across stages
  • Prolonged problem-solving sessions
Trade-offs

Pros

  • Enables human-equivalent work sessions
  • Handles complex, multi-stage tasks

Cons

  • Requires advanced model architecture
  • Higher computational cost
View Original →

Lethal Trifecta Threat Model

Reliability & Eval
Flow
    [Private Data]
         /    \
        /      \
[Untrusted] ─── [External Comm]

All 3 = DANGER! Block at least one circle.
flowchart TB A[Private Data] --- B[Untrusted Input] B --- C[External Comm] C --- A style A fill:#ffcccc style B fill:#ffcccc style C fill:#ffcccc
Example
# Pre-execution policy check
if (tool.can_externally_communicate and
    tool.accesses_private_data and
    input_source == "untrusted"):
    raise SecurityError("Lethal trifecta!")
Problem

Combining private data + untrusted input + external communication enables prompt injection data exfiltration.

Solution

Audit every tool for these 3 capabilities and guarantee at least one is blocked in any execution path.

When to use
  • Agents with tool access
  • Processing untrusted data
  • Security-critical systems
Trade-offs

Pros

  • Simple mental model
  • Eliminates attack class

Cons

  • Limits all-in-one agents
  • Capability tagging effort
View Original →

Merged Code + Language Skill Model

Reliability & Eval
Flow
Base LLM ──┬──▶ Lang Specialist ──┐
           │                       │
           └──▶ Code Specialist ──┼──▶ Weight Merge ──▶ Unified Model
                                   │
              (Fisher Avg / α=0.5)
flowchart LR A[Base LLM] --> B[Lang Specialist] A --> C[Code Specialist] B --> D[Weight Merge] C --> D D --> E[Unified Model]
Example
# Merge two specialist checkpoints
python merge_models.py \
  --model_a lang-specialist.pt \
  --model_b code-specialist.pt \
  --output merged-agent.pt \
  --alpha 0.5
Problem

Training a single model for both code and natural language requires massive compute and risks skill interference.

Solution

Train separate specialist models independently, then merge weights to combine skills without centralized training.

When to use
  • Building multi-skill models (code + NL)
  • Limited centralized compute resources
  • Parallel R&D teams for different capabilities
Trade-offs

Pros

  • Parallel development of skills
  • Reduced centralized compute needs

Cons

  • Potential skill dilution from averaging
  • Requires identical model architectures
View Original →

RLAIF (Reinforcement Learning from AI Feedback)

Reliability & Eval
Flow
                    ┌──────────────────┐
Prompt ──▶ Model ──▶│ Response A       │
                    │ Response B       │
                    └────────┬─────────┘
                             ▼
                    ┌──────────────────┐
                    │  AI Critic       │
                    │  + Constitution  │──▶ Preference (A > B)
                    └──────────────────┘
                             │
                             ▼
                    Train Reward Model
flowchart LR A[Prompt] --> B[Generate A, B] B --> C[AI Critic] C --> D[Compare with Principles] D --> E[Preference Data] E --> F[Train Reward Model]
Example
class RLAIF:
    def generate_preference(self, prompt, a, b):
        critique = f"""Given principles: {self.constitution}
Which response is better for "{prompt}"?
A: {a}  B: {b}
Choose and explain why."""
        return self.critic.generate(critique)
Problem

Traditional RLHF requires extensive human annotation for preference data, which is expensive ($1+ per annotation), time-consuming, and creates a bottleneck in training aligned AI systems.

Solution

Use AI models to generate preference feedback and evaluation data based on constitutional principles, reducing costs to less than $0.01 per annotation while maintaining quality.

When to use
  • Need large-scale preference data
  • Human annotation is too expensive
  • Training aligned AI systems
Trade-offs

Pros

  • 100x cheaper than human feedback
  • Unlimited scalability
  • More consistent than varying annotators

Cons

  • May amplify existing model biases
  • Cannot provide truly novel insights
  • Requires careful principle design
View Original →

Structured Output Specification

Reliability
Flow
Input ──▶ LLM + Schema ──▶ Validated Output ──┬──▶ DB
                                              ├──▶ API
                                              └──▶ Next Agent
flowchart LR A[Agent Input] --> B[LLM + Schema] B --> C[Validated Output] C --> D[Downstream System] C --> E[Database] C --> F[Next Agent]
Example
const schema = z.object({
  category: z.enum(['spam', 'legit']),
  confidence: z.number().min(0).max(1)
});
const result = await generateObject({
  model, schema, prompt
});
Problem

Free-form agent outputs are hard to validate, parse, and integrate with downstream systems

Solution

Constrain outputs using deterministic schemas (JSON Schema, Zod, Pydantic) enforced at generation time

When to use
  • Multi-phase agent workflows
  • Classification/categorization tasks
  • Integration with databases or APIs
Trade-offs

Pros

  • Guaranteed parseable outputs
  • Type safety and validation

Cons

  • Rigidity limits free-form responses
  • Schema evolution friction
View Original →

Versioned Constitution Governance

Reliability
Flow
Agent ──▶ Propose Change ──▶ Git PR
                                  │
                           ┌──────▼──────┐
                           │   CI Check  │
                           │ - Signed?   │
                           │ - Policy OK?│
                           └──────┬──────┘
                                  │
                     ┌────────────┼────────────┐
                     ▼            ▼            ▼
                  [PASS]      [REVIEW]      [REJECT]
                     │            │
                     ▼            ▼
               Gatekeeper ──▶ Merge
flowchart LR A[Agent] -->|propose| B[Git PR] B --> C{CI Check} C -->|pass| D[Gatekeeper] D -->|approve| E[Merge] C -->|fail| F[Reject] E --> G[Constitution HEAD]
Example
# constitution.yaml (in signed git repo)
rules:
  - name: "no_secret_exfil"
    level: critical
    immutable: true
  - name: "confirm_destructive"
    level: high

# CI: flag deletion of critical rules
Problem

Self-modifying agents can accidentally violate safety rules or regress on alignment when rewriting their constitution.

Solution

Store constitution in version-controlled, signed repo with CI policy checks; agent proposes, gatekeeper merges.

When to use
  • Agent can modify its own rules
  • Safety constraints are critical
  • Audit trail is required
Trade-offs

Pros

  • Full audit history
  • Prevents unauthorized changes

Cons

  • Slower iteration cycle
  • Requires governance overhead
View Original →

Asynchronous Coding Agent Pipeline

Reliability & Eval
Flow
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Inference  │───▶│ Tool Queue  │───▶│   Tool      │
│  Workers    │    │ (Redis/MQ)  │    │  Executors  │
│   (GPU)     │◀───│             │◀───│   (CPU)     │
└──────┬──────┘    └─────────────┘    └─────────────┘
       │
       ▼ trajectories
┌─────────────┐    ┌─────────────┐
│   Replay    │───▶│   Learner   │
│   Buffer    │    │(Policy Upd) │
└─────────────┘    └──────┬──────┘
                          │ new checkpoint
                          ▼
flowchart LR A[Inference Worker] -->|tool call| B[Tool Queue] B -->|request| C[Tool Executor] C -->|result| A A -->|trajectory| D[Replay Buffer] D -->|batch| E[Learner] E -->|checkpoint| A
Example
# Async pipeline components
inference_worker.submit_action("compile", file)
# No blocking - continue inference

# Tool executor (separate process)
tool_queue.subscribe("compile_requests")
result = run_compile(request)
result_queue.publish(result)

# Learner updates policy periodically
learner.update_from_buffer(replay_buffer)
Problem

Synchronous tool execution (compilation, testing) creates compute bubbles and idle GPUs, blocking agents while waiting for I/O-bound operations.

Solution

Decouple inference, tool execution, and learning into parallel async components communicating via message queues.

When to use
  • Running RL training for coding agents
  • Tool calls have high latency (compilation, tests)
  • Need to maximize GPU utilization
Trade-offs

Pros

  • High utilization - GPUs stay busy during I/O
  • Scalable - independently scale inference, tools, learning

Cons

  • Complex system maintenance across services
  • Staleness management for policy updates
View Original →

No-Token-Limit Magic

Reliability & Eval
Flow
PROTOTYPE PHASE                    PRODUCTION PHASE
    │                                    │
    ▼                                    ▼
[No Limits] ──▶ [Rich Output] ──▶ [Pattern Discovery] ──▶ [Optimized]
    │               │                    │                    │
 $$$$$           Quality             Insights              $
flowchart LR A[Prototype: No Limits] --> B[Rich Output] B --> C[Pattern Discovery] C --> D[Production: Optimized]
Example
# Prototype: no limits
config = {
    "max_tokens": None,  # Unlimited
    "reasoning_passes": 5,
    "self_correction": True
}
# Find patterns, then optimize later
Problem

Aggressive prompt compression to save tokens stifles reasoning depth and self-correction capabilities.

Solution

During prototyping, remove hard token limits. Allow lavish context and multiple reasoning passes to discover valuable patterns.

When to use
  • Prototyping and experimentation phase
  • Discovering optimal reasoning patterns
  • Tasks requiring deep self-correction
Trade-offs

Pros

  • Dramatically better output quality
  • Surfaces valuable patterns for optimization

Cons

  • Higher cost during prototyping
  • Not suitable for production without optimization
View Original →

Deterministic Security Scanning Build Loop

Security & Safety
Flow
Agent ──▶ Generate ──▶ make all ──▶ Scan Pass?
                                        │
                          ┌─────────────┴─────────────┐
                          │                           │
                         No                          Yes
                          │                           │
                    See errors              ──▶ Complete ✓
                          │
                    ◀── Regenerate
flowchart TD A[Agent generates code] --> B[Run make all] B --> C{Security scan?} C -->|Fail| D[Error in context] D --> E[Agent regenerates] E --> A C -->|Pass| F[Done]
Example
all: build test security-scan

security-scan:
    semgrep --config=auto src/
    bandit -r src/
    @exit $?
Problem

Non-deterministic security approaches (cursor rules, MCP tools) are suggestions that LLMs may ignore.

Solution

Integrate deterministic security scanners into the build loop that agents must run after every change.

When to use
  • AI-assisted code generation
  • Security-critical applications
  • Need consistent policy enforcement
Trade-offs

Pros

  • Deterministic, battle-tested tools
  • Reuses existing security infra
  • Works with any agent/harness

Cons

  • Increases build time
  • May produce false positives
  • Requires fast tools for good DX
View Original →

Egress Lockdown

Security
Flow
┌─────────────────────────────────────────┐
│            SANDBOXED AGENT              │
│  [Private Data] ──▶ [Agent] ──▶ ???     │
└─────────────────────────────────────────┘
                      │
          ┌───────────┼───────────┐
          ▼           ▼           ▼
    api.internal   attacker.com  ANY
         ✓             ✗          ✗

   OUTPUT DROP (default) | ACCEPT (whitelist)
flowchart LR A[Agent] -->|Request| FW[Egress Firewall] FW -->|api.internal| OK[Allowed] FW -->|attacker.com| X[Blocked] FW -->|dynamic URL| X
Example
# Docker egress rules
iptables -P OUTPUT DROP  # default-deny
iptables -A OUTPUT \
  -d api.mycompany.internal \
  -j ACCEPT

# Log blocked attempts
iptables -A OUTPUT -j LOG
Problem

Even with private-data access and untrusted inputs, attacks fail if the agent has no way to transmit stolen data.

Solution

Implement egress firewall: allow only specific domains, strip content from outbound calls, forbid dynamic link generation.

When to use
  • Agents handle sensitive data
  • Processing untrusted inputs
  • High-security environments
Trade-offs

Pros

  • Drastically reduces leak risk
  • Easy to reason about
  • Simple network rules

Cons

  • Breaks legitimate integrations
  • Requires proxy stubs
  • Limits agent capabilities
View Original →

Isolated VM per RL Rollout

Security & Safety
Flow
               ┌─── VM1 [Rollout 1] ───┐
               │   shell("grep TODO")  │
Training ──▶   ├─── VM2 [Rollout 2] ───┼──▶ Rewards
               │   shell("rm temp")    │
               └─── VM3 [Rollout 3] ───┘

Each VM: [Fresh FS] → [Execute] → [Destroy]
sequenceDiagram participant T as Training participant VM1 as VM 1 participant VM2 as VM 2 T->>VM1: Rollout 1: shell() T->>VM2: Rollout 2: shell() VM1-->>T: Result VM2-->>T: Result Note over VM1,VM2: VMs destroyed
Example
@app.cls(image=base_image, timeout=600)
class IsolatedToolExecutor:
    def execute_shell(self, rollout_id, cmd):
        # Safe: isolated VM per rollout
        return subprocess.run(cmd, shell=True)
    # VM auto-destroyed after rollout
Problem

RL training rollouts share infrastructure, causing cross-contamination and corrupted rewards when agents execute destructive commands.

Solution

Spin up an isolated VM/container for each rollout, ensuring complete environment isolation with fresh state.

When to use
  • RL training with tool-using agents
  • Agents with shell/filesystem access
  • Parallel rollouts (100+)
Trade-offs

Pros

  • Complete isolation
  • Deterministic rewards

Cons

  • Cost of 100s VMs
  • Provisioning latency
View Original →

PII Tokenization

Security & Safety
Flow
Tool ──▶ [john@x.com] ──▶ MCP Client ──▶ [EMAIL_1] ──▶ Model
                              │
                         Tokenize
                              │
Model ──▶ "Send to [EMAIL_1]" ──▶ MCP ──▶ [john@x.com] ──▶ Tool
                                    │
                               Untokenize
flowchart LR A[Tool Response] --> B[MCP Client] B --> C[Tokenize PII] C --> D[Model sees tokens] D --> E[Tool Call with tokens] E --> F[Untokenize] F --> G[Real Tool Call]
Example
# MCP client intercepts
data = {"email": "john@example.com"}
# Tokenized for model
ctx = {"email": "[EMAIL_1]"}
# Agent reasons with tokens
# Real value restored for tools
Problem

Sending raw PII through the model's context creates privacy risks and compliance concerns.

Solution

Intercept tool responses to tokenize PII before reaching model, untokenize when making actual tool calls.

When to use
  • Workflows processing customer data
  • Compliance-sensitive environments (GDPR, HIPAA)
  • Multi-step automation involving PII
Trade-offs

Pros

  • Prevents raw PII in model context
  • Transparent to agent reasoning

Cons

  • PII detection must be accurate
  • Doesn't prevent inference of patterns
View Original →

Agent-First Tooling and Logging

Tool Use & Environment
Flow
Agent ──▶ CLI --for-agent --json ──▶ System
                                        │
              ┌─────────────────────────┘
              ▼
         Unified Logger (JSON Lines)
              │
              ▼
         Agent Parses Structured Data
              │
              ▼
         Next Action Based on Analysis
sequenceDiagram participant A as Agent participant C as CLI --for-agent participant L as Unified Logger A->>C: command --json C->>L: Write JSON log entry L->>A: Return structured data A->>A: Parse & decide next action
Example
# Human-friendly (hard to parse)
$ npm test
PASS src/test.js

# Agent-friendly (easy to parse)
$ npm test --json --for-agent
{"status":"pass","file":"src/test.js"}
Problem

컬러, 멀티라인 출력을 가진 사람 중심 도구는 에이전트가 안정적으로 파싱하기 어려움

Solution

기계 판독 가능한 출력(JSON), 통합 로그, --for-agent 플래그를 가진 에이전트 우선 도구 설계

When to use
  • 에이전트 기반 워크플로우 구축
  • 통합할 다중 로그 소스가 있을 때
  • 신뢰할 수 있는 자동화가 필요할 때
Trade-offs

Pros

  • 더 나은 파싱 정확도
  • 토큰 낭비 감소

Cons

  • 사람의 가독성 희생
  • 도구 수정에 투자 필요
View Original →

CLI-Native Agent Orchestration

Tool Use & Environment
Flow
┌─────────────────────────────────────┐
│          Integration Points         │
└─────────────────────────────────────┘
        │         │         │
        ▼         ▼         ▼
   Makefile   Git Hook   Cron Job
        │         │         │
        └────┬────┴────┬────┘
             │         │
             ▼         ▼
      claude spec   claude repl
             │
             ▼
      Local Context ──▶ Agent ──▶ Output
flowchart TD A[Makefile] --> D[claude CLI] B[Git Hook] --> D C[Cron Job] --> D D --> E[Load Local Context] E --> F[Agent Processing] F --> G[Output/Changes]
Example
# In your Makefile
generate-from-spec:
    claude spec run --input api.yaml --output src/

test-spec-compliance:
    claude spec test --spec api.yaml --codebase src/

# Git pre-commit hook
claude spec test || exit 1
Problem

Web chat UIs are awkward for repeat runs, local file edits, or scripting inside CI pipelines.

Solution

Expose agent capabilities through a first-class CLI for Makefiles, Git hooks, cron jobs, and headless automation.

When to use
  • Integrating agents into CI/CD pipelines
  • Automating repetitive development tasks
  • Need headless or scripted agent execution
Trade-offs

Pros

  • Scriptable and composable with other tools
  • Works offline with local context
  • Easy to embed in existing workflows

Cons

  • Initial install and auth setup
  • Learning curve for CLI flags
View Original →

Code-Then-Execute Pattern

Tool Use & Environment
Flow
LLM ──▶ [Write DSL] ──▶ Static Check ──▶ Sandbox Run
             │              │                │
        ┌────┴────┐    ┌────┴────┐      ┌────┴────┐
        │ x=read  │    │ taint   │      │ locked  │
        │ y=proc  │───▶│ verify  │─────▶│ execute │
        │ z=write │    │ flows   │      │         │
        └─────────┘    └─────────┘      └─────────┘
flowchart LR A[LLM] --> B[Write DSL Code] B --> C[Static Checker] C --> D[Taint Verify] D --> E[Sandbox Execute] E --> F[Results]
DSL Example
x = calendar.read(today)
y = QuarantineLLM.format(x)
email.write(to="john@acme.com", body=y)
# Static check: tainted var can't reach recipient
Problem

Plan lists are opaque; need full data-flow analysis and taint tracking for security

Solution

LLM outputs sandboxed program/DSL. Static checker verifies flows, interpreter runs in locked sandbox

When to use
  • Complex multi-step agents
  • SQL copilots
  • Software engineering bots
Trade-offs

Pros

  • Formal verifiability
  • Replay logs for audit

Cons

  • Requires DSL design
  • Static analysis infrastructure
View Original →

Dual-Use Tool Design

Tool Use
Flow
Human ──▶ /commit ──┐
                    │     ┌────────────────┐
                    ├───▶ │  Shared Tool   │ ──▶ Result
                    │     │  (same logic)  │
Agent ──▶ /commit ──┘     └────────────────┘

"Everything you can do, Claude can do"
flowchart LR H[Human] -->|/commit| T[Shared Tool] A[Agent] -->|/commit| T T --> R[Same Result] T --> L[Same Logs]
Example
define_slash_command("/commit", {
    "steps": ["lint", "gen message", "commit"],
    "callable_by": ["human", "agent"],
    "pre_allowed": ["git add", "git commit"]
})
# Human: $ /commit
# Agent: agent.call("/commit")
Problem

Building separate tools for humans and AI agents creates maintenance overhead, inconsistent behavior, and feature drift.

Solution

Design all tools to be dual-use: same interface, shared logic, equally accessible to both humans and AI agents.

When to use
  • Building developer tools
  • Creating slash commands
  • Designing agent-assisted workflows
Trade-offs

Pros

  • Reduced maintenance (one implementation)
  • Consistent behavior for both
  • Single test suite

Cons

  • Must satisfy both ergonomics
  • May compromise optimization
  • Documentation challenge
View Original →

LLM-Friendly API Design

Tool Use & Environment
Flow
Human API          LLM-Friendly API
─────────          ────────────────
Complex nesting    Flat structure
Implicit version   Explicit v2.0
Cryptic errors     Actionable errors
Many indirects     2 levels max
flowchart LR A[API v2.0] --> B[Self-descriptive] B --> C[Clear Errors] C --> D[LLM Success]
Example
# LLM-friendly function
def create_user_v2(
    name: str,      # Descriptive param names
    email: str
) -> UserResult:   # Typed return
    """Creates a new user account."""
    # Clear errors, not cryptic codes
Problem

APIs designed for humans are often ambiguous or complex for LLMs, causing unreliable tool use.

Solution

Design APIs with explicit versioning, self-descriptive names, clear errors, and minimal indirection.

When to use
  • Building agent-callable APIs
  • Exposing tools to LLMs
  • Internal library design
Trade-offs

Pros

  • Reliable LLM tool use
  • Self-correcting errors

Cons

  • API redesign effort
  • May be verbose
View Original →

Multi-Platform Communication Aggregation

Tool Use
Flow
                    ┌─── iMessage ───┐
                    │                │
Query ──▶ Agent ───┼─── Slack ──────┼──▶ Aggregate ──▶ Results
                    │                │
                    └─── Email ──────┘

Output: [{platform, sender, time, content, url}, ...]
flowchart LR Q[User Query] --> A[Aggregator] A --> M[iMessage] A --> S[Slack] A --> E[Email] M --> R[Results] S --> R E --> R R --> T[Unified Table]
Example
search_all() {
  query="$1"
  messages search "$query" > /tmp/msg.json &
  slack search "$query" > /tmp/slack.json &
  email search "$query" > /tmp/email.json &
  wait
  aggregate_results /tmp/*.json
}
Problem

Searching for info across multiple platforms (email, Slack, iMessage) requires slow manual checks on each

Solution

Create unified search interface that queries all platforms in parallel and aggregates results into single format

When to use
  • "Where did X mention Y?" searches
  • Finding conversations without knowing platform
  • Cross-platform audit/compliance
  • Building unified inbox features
Trade-offs

Pros

  • Single query searches all platforms
  • Parallel execution minimizes latency
  • Extensible to new platforms

Cons

  • Must maintain per-platform adapters
  • Cross-platform ranking is subjective
  • Aggregation increases privacy exposure
View Original →

Patch Steering via Prompted Tool Selection

Tool Use & Environment
Flow
User: "Refactor X" ──▶ [+Prompt: "Use ASTRefactor"] ──▶ Agent
                                                          │
                                                    Selects: ASTRefactor
                                                          │
                                                          ▼
                                                    Safe AST Patch
flowchart LR A[User Request] --> B[Augmented Prompt] B --> C[Agent selects ASTRefactor] C --> D[Safe Code Patch]
Example
# Prompt template with tool steering
prompt = f"""
Task: {task_description}
Preferred tool: ASTRefactor
Usage: {{"file": str, "pattern": str}}
Fallback: apply_patch if AST fails
"""
Problem

Agents with multiple patching tools may choose suboptimal ones, causing inconsistent and lower quality results.

Solution

Steer tool selection through explicit natural language instructions, specifying preferred tools and usage patterns in prompts.

When to use
  • Multiple patching/refactoring tools available
  • Need for consistent tool usage
  • AST-safe operations preferred over text replace
Trade-offs

Pros

  • Predictable tool selection behavior
  • Higher code quality with semantic tools

Cons

  • Prompt length increases
  • Requires maintenance as tools evolve
View Original →

Proactive Trigger Vocabulary

UX & Collaboration
Flow
User Input ──▶ Match Triggers?
                    │
     ┌──────────────┼──────────────┐
     │              │              │
     ▼              ▼              ▼
"sup" ──▶      "search hn"    No match
priority       ──▶ hn-search   ──▶ General
-report        skill           response
graph TD A[User Input] --> B{Match Triggers?} B -->|"sup"| C[priority-report skill] B -->|"search hn"| D[hn-search skill] B -->|No match| E[General response]
Example
skill: priority-report
triggers:
  exact: ["sup", "standup prep"]
  contains: ["what should I work on"]
  patterns: ["what.*on my plate"]
proactive: true
Problem

다양한 스킬을 가진 에이전트가 사용자의 자연어 입력을 어떤 스킬로 라우팅할지 불투명하여 사용자가 어떤 문구로 기능을 활성화하는지 알기 어려움

Solution

각 스킬에 명시적 트리거 어휘(키워드, 패턴)를 정의하고 문서화하여 투명하고 예측 가능한 스킬 라우팅 제공

When to use
  • 에이전트가 여러 스킬/기능을 보유
  • 사용자에게 명확한 활성화 방법 제공 필요
  • 특정 주제 시 자동 활성화(proactive) 원할 때
  • 빠르고 예측 가능한 라우팅이 중요할 때
Trade-offs

Pros

  • 투명하고 예측 가능한 라우팅
  • 문자열 매칭으로 빠른 속도
  • 디버깅 용이

Cons

  • 트리거 리스트에 없는 표현은 누락
  • 어휘 변화에 따른 유지보수 필요
  • 다국어/문화권 편향 가능
View Original →

Shell Command Contextualization

Tool Use
Flow
User ──▶ "!ls -la" ──▶ Interface ──▶ Shell ──▶ Output
                                        │
          Agent Context ◀───────────────┘
              [cmd + output injected]
sequenceDiagram User->>Interface: !ls -la Interface->>Shell: Execute command Shell-->>Interface: Output Interface->>Agent: Inject cmd + output Agent-->>User: Response with context
Example
# User types in Claude Code:
!git status

# Agent receives in context:
# Command: git status
# Output: On branch main, 2 files changed...
Problem

Manually copying shell command output into agent context is tedious and error-prone

Solution

Provide a special prefix (e.g., !) that executes shell commands and auto-injects both command and output into agent context

When to use
  • Agent needs real-time environment state
  • Checking git status, file listings, test results
  • Interactive debugging with agent assistance
Trade-offs

Pros

  • Seamless environment integration
  • No manual copy-paste required

Cons

  • Security risk from arbitrary commands
  • Large outputs can bloat context
View Original →

Tool Use Steering via Prompting

Tool Use
Flow
User Task ──┬──▶ [Direct: "Use file search tool"]
            │
            ├──▶ [Teach: "Use barley CLI, -h for help"]
            │
            ├──▶ [Implicit: "commit, push, pr"]
            │
            └──▶ [Think: "*think hard*"]
                       │
                       ▼
               Agent Tool Selection ──▶ Execute
flowchart TD A[User Task] --> B[Available Tools] A --> C[Explicit Guidance] C --> D[Direct Invocation] C --> E[Teaching Usage] C --> F[Implicit Shortcut] D & E & F --> G[Agent Selection] G --> H[Execute]
Example
# Direct invocation
"Use the file search tool to find config files"

# Teaching tool usage
"Use our barley CLI to check logs. -h for help"

# Implicit shortcut (learned association)
"commit, push, pr"  # agent knows git workflow
Problem

Having tools available does not guarantee agents will use them appropriately, especially for custom or team-specific tools.

Solution

Guide tool selection via explicit prompts: direct invocation, teaching tool usage, implicit shortcuts, and deeper reasoning triggers.

When to use
  • Custom or team-specific tools exist
  • Agent tool selection is suboptimal
  • Teaching new tool workflows
Trade-offs

Pros

  • Direct control over tool usage
  • Enables custom tool onboarding

Cons

  • Requires prompt engineering
  • May reduce agent autonomy
View Original →

Agent SDK for Programmatic Control

Tool Use & Environment
Flow
Application/Script ──▶ Agent SDK
                            │
          ┌─────────────────┼─────────────────┐
          │                 │                 │
          ▼                 ▼                 ▼
     CLI Interface    Python Lib      TS Library
          │                 │                 │
          └─────────────────┼─────────────────┘
                            ▼
                       Agent Core
                     (Tools, Memory)
flowchart TD A[Application/Script] --> B[Agent SDK] B --> C[CLI Interface] B --> D[Python Library] B --> E[TypeScript Library] C & D & E --> F[Agent Core] F --> G[Tool Access] F --> H[Memory]
Example
# CLI usage for CI/CD
$ claude -p "what changed this week?" \
  --allowedTools "Bash(git log:*)" \
  --output-format json

# Python SDK
agent.run("Review PR", tools=["git"])
Problem

대화형 인터페이스는 CI/CD 파이프라인, 스케줄 작업, 커스텀 애플리케이션 통합을 지원하지 않음

Solution

프로그래밍 방식 접근 및 자동화를 위해 에이전트 액션을 노출하는 SDK(CLI, Python, TypeScript) 제공

When to use
  • CI/CD 통합이 필요할 때
  • 커스텀 애플리케이션 구축 시
  • 배치 처리 워크플로우
Trade-offs

Pros

  • 자동화 파이프라인 가능
  • 유연한 통합 옵션

Cons

  • SDK 학습 곡선
  • API 키 관리가 필요할 수 있음
View Original →

Code Mode MCP Tool Interface

Tool Use & Environment
Flow
Traditional:
  LLM ──▶ Tool1 ──▶ [JSON 1000t] ──▶ LLM
  LLM ──▶ Tool2 ──▶ [JSON 1000t] ──▶ LLM
  LLM ──▶ Tool3 ──▶ [JSON 1000t] ──▶ LLM

Code Mode:
  LLM ──▶ V8 Isolate ┬──▶ Tool1
                     ├──▶ Tool2
                     └──▶ Tool3
                          │
                     [Condensed] ──▶ LLM
flowchart LR A[LLM] -->|TypeScript| B[V8 Isolate] B -->|binding| C[MCP Server] C --> D[APIs] D --> C C --> B B -->|condensed| A
Example
// LLM generates orchestration code
const vpc = await createVPC({name: "demo"})
const igw = await createInternetGateway(vpc.id)
const sg = await createSecurityGroup(vpc.id, rules)
const ec2 = await launchEC2({vpcId: vpc.id})
// All calls in isolate - only result to LLM
return {vpcId: vpc.id, publicIP: ec2.ip}
Problem

Traditional MCP tool calls force all intermediate JSON through the context, causing massive token waste on multi-step and fan-out operations.

Solution

LLMs write TypeScript code that orchestrates MCP tools in V8 isolates; only final results return to context.

When to use
  • Multi-step workflows with clear sequences
  • Fan-out operations (100+ items to process)
  • Token costs or latency are critical
Trade-offs

Pros

  • 10x+ token reduction on multi-step workflows
  • Dramatic fan-out efficiency with loops
  • Self-debugging with error handling and retry

Cons

  • Requires V8 isolate infrastructure
  • Poor fit for dynamic research loops
  • Intelligence-in-middle defeats the purpose
View Original →

Dynamic Code Injection

Tool Use
Flow
User: "@src/Button.js"
         │
         ▼
┌─────────────────────────────┐
│  Preprocessor               │
│  1. Parse @mention          │
│  2. Read file (lines 1-50)  │
│  3. Inject into context     │
└─────────────────────────────┘
         │
         ▼
Agent: [Context + Button.js] ──▶ Continue
sequenceDiagram participant U as User participant P as Preprocessor participant FS as File System participant A as Agent U->>P: @src/Button.js:10-50 P->>FS: Read lines 10-50 FS-->>P: File content P->>A: Inject into context A-->>U: Continue with file visible
Example
# Syntax examples
"@path/to/file.ext"        # Full file
"/load file.ext:10-50"     # Line range
"/summarize test_spec.py"  # Extract summary

# Injected as:
# /// BEGIN Button.js
# ...content...
# /// END Button.js
Problem

Manually copying files into prompts is tedious, wastes tokens, and interrupts workflow momentum.

Solution

Allow on-demand file injection via @filename or /load syntax that fetches, optionally summarizes, and injects code into context.

When to use
  • Interactive coding sessions
  • Exploring unfamiliar codebases
  • Need specific file context mid-conversation
Trade-offs

Pros

  • Interactive code exploration
  • No manual copy/paste
  • Improves agent accuracy

Cons

  • Requires file system access
  • Security risk if unsandboxed
  • Summarization may lose context
View Original →

Progressive Tool Discovery

Tool Use & Environment
Flow
servers/
├── google-drive/   ──▶ list (names only)
│   ├── getDocument     ──▶ search (+ description)
│   └── listFiles           ──▶ get (full schema)
├── slack/
└── github/

Agent: list_dir("./servers/") → ["google-drive/", ...]
       search_tools("google-drive/*", "name+desc")
       get_tool_definition("getDocument") → {schema}
flowchart LR A[List Servers] --> B[Browse Category] B --> C[Search Tools] C --> D[Get Full Schema] D --> E[Execute Tool]
Example
# Progressive discovery workflow
servers = list_directory("./servers/")  # names only
tools = search_tools("google-drive/*",
                     detail="name+description")
schema = get_tool_definition(
    "servers/google-drive/getDocument"
)  # full JSON schema when needed
Problem

Loading all tool definitions upfront consumes excessive context window space when agents have access to large tool catalogs, limiting space for actual task execution.

Solution

Present tools through a filesystem-like hierarchy where agents discover capabilities on-demand, requesting different detail levels (name only, description, full schema) as needed.

When to use
  • Systems with 20+ available tools
  • MCP server implementations
  • Plugin architectures with many capabilities
Trade-offs

Pros

  • Reduces initial context consumption
  • Scales to hundreds of tools
  • Natural filesystem-like exploration

Cons

  • Adds discovery overhead
  • Multiple round-trips to find tools
  • Requires thoughtful organization
View Original →

Subagent Compilation Checker

Tool Use
Flow
Main ──▶ "Compile svc-A" ──▶ CompileSubagent ──▶ Build
                                                  │
         Main Context ◀───────────────────────────┘
           [{file:"x.go", line:10, err:"..."}]
sequenceDiagram Main->>Subagent: Compile module A Subagent->>Build: go build alt Success Subagent-->>Main: {status: ok, artifact: "a.bin"} else Failure Subagent-->>Main: [{file, line, error}] end
Example
result = spawn_compile_agent("auth-service")
# Returns concise error summary:
# [{file:"auth.go", line:85, error:"undefined"}]
main_context.inject(result.errors)
Problem

Including full build logs in main agent context blows up context length and slows inference

Solution

Spawn specialized subagents to compile each module, returning only concise error summaries to main agent

When to use
  • Multi-module/microservice projects
  • Build logs too verbose for context
  • Parallel compilation needed
Trade-offs

Pros

  • Main agent context stays clean
  • Parallel builds possible

Cons

  • Infrastructure to manage subagents
  • Build dependency coordination
View Original →

Virtual Machine Operator Agent

Tool Use
Flow
User ──▶ Agent ──▶ ┌──────────────────────┐
                   │    VIRTUAL MACHINE   │
                   │ ┌──────────────────┐ │
                   │ │ Execute Code     │ │
                   │ │ Install Packages │ │
                   │ │ File Operations  │ │
                   │ │ Run Applications │ │
                   │ └──────────────────┘ │
                   └──────────┬───────────┘
                              │
                   Agent ◀────┘ Results
sequenceDiagram User->>Agent: Complex Task Agent->>VM: Execute Code Agent->>VM: Install Packages Agent->>VM: File Ops VM-->>Agent: Results Agent-->>User: Task Report
Example
# Agent operating in VM environment
vm.execute("pip install pandas matplotlib")
vm.execute("python analyze_data.py")
vm.read_file("/output/report.pdf")
vm.execute("git add . && git commit -m 'Add report'")
# Agent has full computer operator capability
Problem

Agents limited to code generation cannot perform complex tasks requiring full computer environment interaction.

Solution

Give agent access to a dedicated VM environment to execute code, install packages, manage files, and run applications.

When to use
  • Tasks require full OS interaction
  • Need to install and run software
  • Complex multi-step system operations
Trade-offs

Pros

  • General-purpose digital operator
  • Full system capability

Cons

  • Higher security risk
  • Resource intensive
View Original →

Agentic Search Over Vector Embeddings

Tool Use & Environment
Flow
Traditional RAG:
  Code ──▶ Index ──▶ Vector DB ──▶ Query
           (stale)     (infra)

Agentic Search:
  Query ──▶ grep/find ──▶ Refine ──▶ Results
              │              │
              └──────────────┘
           (current state, no infra)
flowchart LR A[Search Query] --> B[grep/ripgrep] B --> C[Analyze Results] C --> D{Found?} D -->|No| E[Refine Search] E --> B D -->|Yes| F[Return Results]
Example
# Instead of vector search:
# vector_db.index(codebase)  # stale!
# results = vector_db.query(embed(q))

# Use agentic search:
agent.call_tool("grep", "function.*auth")
agent.call_tool("find", "**/auth/*.ts")
agent.refine_search_based_on_results()
Problem

벡터 임베딩은 지속적인 재인덱싱, 로컬 변경 처리, 인프라 오버헤드가 필요함

Solution

벡터 검색을 bash, grep, 파일 탐색을 사용하는 에이전틱 검색으로 대체 - 사전 인덱싱 불필요

When to use
  • 자주 변경되는 코드베이스
  • 벡터 인프라가 없을 때
  • 보안에 민감한 배포
Trade-offs

Pros

  • 인덱싱 유지보수 불필요
  • 항상 현재 상태를 검색

Cons

  • 여러 번 반복이 필요할 수 있음
  • 매우 큰 코드베이스에서 느림
View Original →

Code-Over-API Pattern

Tool Use & Environment
Flow
Direct API (High Cost):
  Agent ──▶ API ──▶ [10K rows: 150K tokens] ──▶ Context

Code-Over-API (Low Cost):
  Agent ──▶ Write Code ──▶ Sandbox
                              │
                         [Process 10K]
                              │
                         [Filter/Agg]
                              │
                         [Summary: 2K] ──▶ Context
flowchart LR A[Agent] -->|write code| B[Sandbox] B -->|fetch| C[API/Data] C -->|10K rows| B B -->|process| B B -->|summary only| A
Example
def process_spreadsheet():
    # Fetch in execution env (not context)
    rows = spreadsheet.getRows(sheet_id="abc")
    # Filter in code
    active = [r for r in rows if r.status == "active"]
    # Only summary to context
    print(f"Found {len(active)} of {len(rows)}")
    return active[:5]  # Sample only
Problem

Direct API calls force all intermediate data through the model's context window, causing 150K+ tokens for data-heavy workflows.

Solution

Agent writes code that executes in sandbox; data processing happens in execution environment with only results returning to context.

When to use
  • Data-heavy workflows (spreadsheets, databases, logs)
  • Multi-step transformations or aggregations
  • Cost-sensitive applications where token usage matters
Trade-offs

Pros

  • Dramatic token reduction (150K to 2K reported)
  • Lower latency (fewer context API calls)
  • Natural fit for data processing tasks

Cons

  • Requires secure code execution infrastructure
  • Agents must write correct code
  • Debugging errors happens in execution, not context
View Original →

Visual AI Multimodal Integration

Tool Use
Flow
         ┌─────────┐
Text ────┤         │
         │ Multi   ├──▶ Cross-Modal ──▶ Solution
Image ───┤ Modal   │    Reasoning
         │ Model   │
Video ───┤         │
         └─────────┘

Image ──▶ [OCR] + [Objects] + [Spatial] ──▶ Understanding
flowchart LR A[Text Query] --> M[Multimodal LLM] B[Image] --> M C[Video] --> M M --> D[Object Detection] M --> E[OCR] M --> F[Spatial Understanding] D & E & F --> G[Combined Solution]
Example
class VisualAIAgent:
    async def process(self, task, image):
        analysis = await self.mm_llm.analyze(
            prompt=f"Task: {task}",
            image=image
        )
        # Extract: objects, OCR text, spatial info
        return await self.solve_with_visual(analysis)
Problem

Text-only agents miss critical visual information in images, videos, diagrams, and UI screenshots.

Solution

Integrate multimodal models (LMMs) to accept visual inputs, extract information, and combine with text for cross-modal reasoning.

When to use
  • Tasks involve images or screenshots
  • UI debugging or visual analysis
  • Document/chart understanding
Trade-offs

Pros

  • Enables new visual task categories
  • More natural show-not-tell interaction

Cons

  • Higher computational cost
  • Privacy concerns with visual data
View Original →

Abstracted Code Representation for Review

UX & Collaboration
Flow
AI Code ──▶ Abstractor ──▶ Pseudocode/Summary
   │                              │
   │         ┌────────────────────┘
   │         ▼
   │    Human Review (Intent)
   │         │
   └─────────┼──▶ Verified Code
             ▼
    "Sort changed: O(n²) → O(n log n)"
flowchart LR A[AI Generated Code] --> B[Abstraction Layer] B --> C[Pseudocode/Summary] C --> D[Human Reviews Intent] D --> E{Approved?} E -->|Yes| F[Apply Code Changes] E -->|No| G[Request Revisions]
Example
# Instead of reviewing 50 lines of code:
review = abstractor.summarize(diff)
# Output: "Changed user_list sorting from
#          bubble sort to quicksort.
#          Tests maintained."
human.verify_intent(review)  # Much faster!
Problem

AI 생성 코드를 라인 단위로 검토하는 것은 지루하고 오류가 발생하기 쉬우며, 사람은 주로 의도와 로직에 관심이 있음

Solution

실제 코드 변경과 정확히 매핑된다는 보장과 함께 추상화된 표현(의사코드, 의도 요약)을 제공

When to use
  • 대량의 AI 생성 코드
  • 구문 검사보다 의도 검증이 중요할 때
  • 신뢰할 수 있는 생성 파이프라인
Trade-offs

Pros

  • 검증 프로세스 가속화
  • 개념적 정확성에 집중 가능

Cons

  • 추상화 정확성 보장 필요
  • 저수준 버그를 놓칠 수 있음
View Original →

AI-Accelerated Learning and Skill Development

UX & Collaboration
Flow
     ┌─────────────────────────────────────────────┐
     │           Fast Learning Loop                │
     └─────────────────────────────────────────────┘
              ┌──────────┐
              │  Try it  │
              └────┬─────┘
                   ▼
     ┌─────────────────────────┐
     │  AI Feedback / Explain  │◀──┐
     └───────────┬─────────────┘   │
                 ▼                 │
          ┌────────────┐           │
          │Learn & Fix │───────────┘
          └────────────┘
                 │
                 ▼
           [Skill++]
flowchart TD A[Try Code] --> B[AI Feedback] B --> C[Learn from Mistakes] C --> D[Improve & Iterate] D --> A D --> E[Skill Development]
Example
# Ask AI to explain unfamiliar code
ai.explain("What does this regex do?")
# AI explains step-by-step

# Learn from AI suggestions
ai.review(my_code)
# "Consider using list comprehension
#  instead of this loop for clarity"
Problem

코드 품질에 대한 '감각' 개발과 스킬 습득이 오랜 경험과 멘토링을 필요로 하며, 주니어 개발자에게 특히 느린 과정

Solution

AI 에이전트를 대화형 학습 도구로 활용하여 빠른 반복, 실수 학습, 베스트 프랙티스 관찰로 스킬 습득 가속화

When to use
  • 새로운 언어/프레임워크 학습 시
  • 코드 품질 감각을 빠르게 키우고 싶을 때
  • 멘토 없이 독학할 때
  • 다양한 접근법을 실험하고 싶을 때
Trade-offs

Pros

  • 빠른 피드백 루프로 학습 가속
  • 항상 사용 가능한 튜터
  • 실험에 대한 두려움 감소

Cons

  • AI 코드 품질에 학습 품질 의존
  • 깊은 이해 없이 표면적 학습 위험
View Original →

Chain-of-Thought Monitoring & Interruption

UX & Collaboration
Flow
Agent: "I'll modify auth.ts..."
   │
   ▼
[Start file read] ──▶ Dev watching
                         │
                    INTERRUPT!
                         │
                         ▼
Dev: "Use oauth.ts instead"
   │
   ▼
Agent: [Read oauth.ts] ──▶ Continue
   │
   ▼
Correction within first tool call
sequenceDiagram participant Dev participant Agent participant Tools Agent->>Dev: Display: "I'll modify auth.ts..." Agent->>Tools: Start file read Dev->>Agent: INTERRUPT! Wrong file Dev->>Agent: "Use oauth.ts instead" Agent->>Tools: Read oauth.ts Agent->>Dev: Display updated reasoning
Example
# Real-time reasoning visibility
stream_reasoning_to_ui(agent.thoughts)

# Low-friction interruption
if user_pressed_escape():
    agent.pause()
    context = user.get_correction()
    agent.inject_context(context)
    agent.resume()
# Partial work preserved on interrupt
Problem

Agents can pursue misguided reasoning paths for extended periods. By the time developers realize it's wrong, significant time and tokens are wasted.

Solution

Real-time surveillance of agent reasoning with low-friction interrupt capability to redirect before completing flawed sequences.

When to use
  • Complex refactoring where wrong file choices are costly
  • High-stakes operations (database migrations, API changes)
  • Agent might misinterpret ambiguous requirements
Trade-offs

Pros

  • Prevents wasted time on fundamentally wrong approaches
  • Maximizes value from expensive model calls
  • Enables collaborative human-AI problem solving

Cons

  • Requires active human attention (not autonomous)
  • Can interrupt productive exploration if triggered prematurely
  • Adds cognitive load to monitor reasoning
View Original →

Democratization of Tooling via Agents

UX & Collaboration
Flow
Non-Dev ──▶ "Make me a dashboard" ──▶ AI Agent ──▶ Code
   │                                       │
   ▼                                       ▼
Iterate ◀── "Add filter for date" ◀─── Working Tool
flowchart LR U[Non-Dev User] -->|Natural Language| A[AI Agent] A -->|Generate| C[Code/Tool] C -->|Review| U U -->|Refine| A
Example
# Sales team member prompt:
"Create a dashboard that shows my
 weekly pipeline from Salesforce,
 with filters by deal stage"

# Agent generates: dashboard.py
# User iterates: "Add export to CSV"
Problem

Non-engineering roles (sales, marketing, ops) need custom tools but lack programming skills to build them

Solution

AI agent translates natural language requests into code, enabling domain experts to create their own tools

When to use
  • Non-developers need custom tools
  • Simple dashboard/script creation
  • Quick bug fixes to existing code
  • Domain experts want self-automation
Trade-offs

Pros

  • Democratizes software access
  • Domain expertise directly applied

Cons

  • Limited for complex systems
  • Requires quality/security review
View Original →

Human-in-the-Loop Approval

Collaboration
Flow
                Low Risk
              ┌────────────────────────────────▶ Execute
              │
Agent ──▶ [Risk?]
              │
              │ High Risk        ┌─────┐
              └───────▶ Human ──▶│ Y/N │──┬──▶ Execute
                              └─────┘  │
                                          └──▶ Abort / Adapt

Gate:  [auto]────────────●────────────[manual]
                     threshold
sequenceDiagram participant A as Agent participant F as Framework participant S as Slack participant H as Human A->>F: DROP table request F->>F: Classify HIGH RISK F->>S: Request approval S->>H: Notification alt Approved H->>S: Approve S->>F: Granted F->>A: Execute else Rejected H->>S: Reject + reason S->>F: Denied F->>A: Find alternative end
Implementation
from humanlayer import HumanLayer
hl = HumanLayer()

@hl.require_approval(channel="slack")
def delete_user_data(user_id: str):
    return db.users.delete(user_id)

# Execution pauses until human approves
delete_user_data("user_123")
Problem

위험한 작업(DB 삭제, 배포)을 자동 실행하면 안전/규정 문제 발생. 모든 작업을 막으면 자동화 의미 없음

Solution

고위험 작업에만 인간 승인 게이트 삽입. 안전 작업은 자동, 위험 작업만 Slack 등으로 승인 요청

When to use
  • DB: DELETE, DROP, ALTER
  • API: 결제, 이메일, 웹훅
  • 시스템: 방화벽, 권한 변경
  • 규정: GDPR, HIPAA, SOC2
Trade-offs

Pros

  • 안전한 자동화
  • 감사 추적

Cons

  • 인간 응답 대기
  • 승인 피로도
View Original →

Human-in-the-Loop Approval Framework

UX & Collaboration
Flow
Agent ──▶ "DROP TABLE" ──▶ Framework
                              │
                    [HIGH RISK DETECTED]
                              │
                              ▼
                         Slack #ops
                    ┌─────────────────┐
                    │ [Approve] [Deny]│
                    └─────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
         Execute + Log                   Agent Adapts
sequenceDiagram Agent->>Framework: DROP old_users Framework->>Slack: Request approval Slack->>Human: [Approve] [Reject] Human->>Slack: Approve Slack->>Framework: Granted Framework->>Agent: Execute Framework->>Log: Record
Example
from humanlayer import HumanLayer
hl = HumanLayer()

@hl.require_approval(channel="slack")
def delete_user_data(user_id: str):
    """Requires human approval before execution"""
    return db.users.delete(user_id)
# Pauses for Slack approval button click
Problem

Autonomous agents need to execute high-risk operations (DB changes, deployments), but unsupervised execution creates unacceptable safety risks.

Solution

Insert human approval gates for high-risk functions via Slack/email/SMS, maintaining agent autonomy for safe operations.

When to use
  • Production database operations (DELETE, DROP)
  • External API calls with side effects
  • Compliance-sensitive operations
Trade-offs

Pros

  • Safe autonomous risky operations
  • Lightweight Slack integration

Cons

  • Requires human availability
  • Risk of approval fatigue
View Original →

Latent Demand Product Discovery

UX & Collaboration
Flow
Hackable     Power Users      Analytics
Product  ──▶  "Abuse"    ──▶   Pattern   ──▶ Productize
             Features         Detection

Example: Groups → 40% buy/sell → Marketplace
flowchart LR A[Extension APIs] --> B[Power User Hacks] B --> C[Pattern Detection] C --> D[New Feature] D --> E[All Users]
Example
# Monitor creative usage patterns
if (slash_commands.custom_count > 100 and
    usage.pattern == "notification"):
    # Many users built this - productize it
    roadmap.add("Built-in notifications")
Problem

Difficult to predict which features have real product-market fit before significant engineering investment.

Solution

Build hackable products, observe how power users repurpose features, then productize validated demand.

When to use
  • Building extensible platforms
  • Feature prioritization
  • New product exploration
Trade-offs

Pros

  • Behavior-validated demand
  • Reduced risk

Cons

  • Extension infra needed
  • Power users != mainstream
View Original →

Spectrum of Control / Blended Initiative

UX & Collaboration
Flow
Low Autonomy          Medium              High              Async
    │                    │                   │                  │
    ▼                    ▼                   ▼                  ▼
[Tab Complete] ──▶ [Cmd-K Edit] ──▶ [Agent Mode] ──▶ [Background Agent]
  (inline)        (region/file)    (multi-file)       (full PR)
flowchart LR A[Tab Completion] --> B[Command K] B --> C[Agent Feature] C --> D[Background Agent] A -.->|Low| A D -.->|High Autonomy| D
Example
# User chooses autonomy level:
cursor.tab()       # inline assist
cursor.cmd_k()     # edit region
cursor.agent()     # multi-file task
cursor.background()# async full PR
Problem

One-size-fits-all agent autonomy doesn't fit varying task complexity or user familiarity

Solution

Provide a spectrum of control modes from low (tab-complete) to high (background agent) that users can switch between

When to use
  • IDE/editor integrations
  • Varying task complexity within one session
  • Users with different comfort levels
Trade-offs

Pros

  • Flexible for all task sizes
  • User maintains desired control level

Cons

  • UX complexity with multiple modes
  • Learning curve for mode switching
View Original →

Team-Shared Agent Configuration

UX & Collaboration
Flow
                    ┌─────────────────────────┐
                    │   .claude/settings.json │
                    │   (version controlled)  │
                    └───────────┬─────────────┘
                                │
         ┌──────────────────────┼──────────────────────┐
         │                      │                      │
         ▼                      ▼                      ▼
    [Dev A] ──▶ git pull ──▶ [Dev B] ──▶ git pull ──▶ [Dev C]
         │                      │                      │
         └──────────────────────┴──────────────────────┘
                    Same Config Everywhere
flowchart TD R[settings.json in Git] --> A[Dev A] R --> B[Dev B] R --> C[Dev C] A --> |git pull| R B --> |git pull| R C --> |git pull| R
Example
// .claude/settings.json
{
  "permissions": {
    "pre_allowed": ["git add", "npm test"],
    "blocked_paths": [".env", "secrets/"]
  },
  "hooks": { "pre_commit": "./run_tests.sh" }
}
Problem

Independent agent configs per developer cause inconsistent behavior, permission friction, and duplicated effort across the team.

Solution

Store agent configuration in version control as code, enabling team-wide sharing via git pull.

When to use
  • Multiple team members use AI agents
  • Consistent agent behavior is critical
  • New members need quick onboarding
Trade-offs

Pros

  • Consistent team experience
  • Faster onboarding

Cons

  • Less individual flexibility
  • Config sprawl over time
View Original →

Agent-Assisted Scaffolding

UX & Collaboration
Flow
Developer ──▶ "Create user API" ──▶ Agent
                                        │
              ┌─────────────────────────┘
              ▼
         Generated Files:
         ├── routes/user.ts
         ├── controllers/user.ts
         ├── models/user.ts
         └── tests/user.test.ts
              │
              ▼
         Developer: Implement Core Logic
flowchart TD A[High-level Description] --> B[Agent Scaffolding] B --> C[Routes] B --> D[Controllers] B --> E[Models] B --> F[Tests] C & D & E & F --> G[Developer Implements Logic]
Example
agent.scaffold(
    description="User profile API endpoint",
    framework="fastapi",
    include=["routes", "models", "tests"]
)
# Agent generates structure
# Developer fills in business logic
Problem

새 기능 시작 시 반복적인 보일러플레이트와 기초 코드 작성이 필요하여 시간이 많이 소요됨

Solution

AI를 사용해 고수준 설명으로부터 초기 구조, 파일, 보일러플레이트를 생성한 후 핵심 로직에 집중

When to use
  • 새 기능 또는 모듈 개발
  • 반복적인 프로젝트 설정 작업
  • 일관된 구조가 필요할 때
Trade-offs

Pros

  • 빠른 프로젝트 시작
  • 일관된 초기 구조

Cons

  • 기존 컨벤션과 불일치할 수 있음
  • 생성된 코드 검토 필요
View Original →

Seamless Background-to-Foreground Handoff

UX & Collaboration
Flow
User: "Refactor X" ──▶ [Background Agent]
                              │
                              ▼
                       Proposes PR (90%)
                              │
           ┌──────────────────┴──────────────────┐
           ▼                                     ▼
     100% Correct ──▶ Done           90% Correct ──▶ Take Over
                                                       │
                                                       ▼
                                          User + Foreground Tools
                                                       │
                                                       ▼
                                                  Finalized PR
flowchart TD A[User Request] --> B[Background Agent] B --> C[Proposed PR] C --> D{Review} D -->|100%| E[Done] D -->|90%| F[Take Over] F --> G[Foreground Edit] G --> E
Example
# Background agent completes task
pr = background_agent.work("Refactor module X")

# User reviews and takes control if needed
if user.review(pr) != "approved":
    # Seamless handoff with context
    foreground = pr.take_control()
    foreground.edit_with_ai_assist()
    foreground.finalize()
Problem

Background agents can handle complex tasks autonomously but may achieve only 90% correctness. A clunky handoff process to human control negates the automation benefits.

Solution

Design systems allowing seamless transition from background agent work to foreground human control, preserving context so users can refine the remaining 10% efficiently.

When to use
  • Background agents producing near-complete work
  • Tasks requiring human finesse for completion
  • Developer workflows with autonomous PR generation
Trade-offs

Pros

  • Leverages autonomous processing power
  • Retains human control for final touches
  • Preserves context across transition

Cons

  • Requires careful UX design
  • Context handoff complexity
  • May create workflow interruptions
View Original →

Verbose Reasoning Transparency

UX & Collaboration
Flow
User ──▶ Complex Task ──▶ Agent ──▶ [Standard Output]
                                         │
                               ┌─────────┘
                               │  Ctrl+R
                               ▼
                    ┌──────────────────────┐
                    │  VERBOSE VIEW        │
                    │  - Reasoning steps   │
                    │  - Tool selection    │
                    │  - Confidence scores │
                    │  - Raw tool outputs  │
                    └──────────────────────┘
sequenceDiagram User->>Agent: Task Agent-->>User: Standard output User->>UI: Ctrl+R UI-->>User: Reasoning steps UI-->>User: Tool rationale UI-->>User: Raw outputs
Example
# Verbose mode output example
{
    "interpretation": "User wants to refactor...",
    "tools_considered": ["grep", "ast-parse", "sed"],
    "tool_selected": "ast-parse",
    "reason": "Need semantic understanding",
    "confidence": 0.87
}
Problem

Complex agents behave like black boxes; users cannot understand why decisions were made or debug unexpected behavior.

Solution

Provide on-demand verbose mode (e.g., Ctrl+R) showing reasoning steps, tool selection rationale, and raw outputs.

When to use
  • Debugging unexpected agent behavior
  • Building trust in agent decisions
  • Learning effective prompting
Trade-offs

Pros

  • Better debugging and trust
  • Helps users improve prompts

Cons

  • Information overload risk
  • May expose internal complexity
View Original →

Agent-Friendly Workflow Design

UX & Collaboration
Flow
Traditional ──▶ Agent-Friendly?
     │               │
     │     ┌─────────┴─────────┐
     │     │                   │
     ▼     ▼                   ▼
  Redesign:              Already OK
  ├── Clear Goals           │
  ├── Autonomy              │
  ├── Structured I/O        │
  └── Feedback Loops        │
          │                 │
          └────────┬────────┘
                   ▼
            Enhanced Performance
flowchart TD A[Traditional Workflow] --> B{Agent-Friendly?} B -->|No| C[Redesign] C --> D[Clear Goals] C --> E[Appropriate Autonomy] C --> F[Structured I/O] C --> G[Feedback Loops] D & E & F & G --> H[Optimized Workflow] B -->|Yes| H
Example
# Bad: Micromanaging
agent.do("Use React, then add useState...")

# Good: Goal-oriented with autonomy
agent.do(
    goal="Build login form",
    constraints=["use existing auth"],
    freedom="choose implementation"
)
Problem

워크플로우가 너무 경직되거나 사람이 기술적 결정을 마이크로매니징하면 에이전트가 어려움을 겪음

Solution

명확한 목표, 적절한 자율성, 구조화된 I/O, 반복적인 피드백 루프를 가진 워크플로우 설계

When to use
  • 에이전트를 기존 프로세스에 통합할 때
  • 에이전트 성능이 최적이 아닐 때
  • 새로운 인간-AI 워크플로우 구축 시
Trade-offs

Pros

  • 에이전트 역량 극대화
  • 더 나은 인간-AI 협업

Cons

  • 워크플로우 재설계 필요
  • 신뢰 구축이 필요할 수 있음
View Original →