Agentic Patterns Snippets

Context-Minimization Pattern

Context & Memory

Flow

User Input ──▶ [Transform] ──▶ Safe Output
    │              │
    │         ┌────┴────┐
    └────────▶│ REMOVE  │
              │ tainted │
              └─────────┘

Context:  [████ tainted ████] → [██ clean ██]

flowchart LR A[User Input] --> B[Transform] B --> C[Remove Input] C --> D[Execute] D --> E[Summarize]

Example

sql = LLM("to SQL", user_prompt)
remove(user_prompt)  # tainted tokens gone
rows = db.query(sql)
answer = LLM("summarize", rows)  # clean context

Problem

User-supplied text lingers in context, enabling it to influence later generations and potentially inject malicious instructions

Solution

Purge untrusted segments after transforming into safe intermediate. Later reasoning sees only trusted data

When to use

Customer service chat
Medical Q&A systems
Multi-turn flows where input shouldn't steer later steps

Trade-offs

Pros

Simple, no extra models
Prevents prompt injection

Cons

Loses conversational nuance
May hurt UX if too aggressive

View Original →

Context Window Anxiety Management

Context & Memory

Flow

Context Window: [████████████░░░░░░░░] 60%

Without Management:          With Management:
├─ "Running low..."         ├─ "Plenty of space"
├─ Summarize early          ├─ Continue working
└─ Rush completion          └─ Thorough output

Buffer: Enable 1M ──▶ Cap at 200k ──▶ Psychological runway

flowchart LR A[Enable 1M tokens] --> B[Cap usage at 200k] B --> C[Add reassurance prompts] C --> D[Model works thoroughly]

Example

prompt = """
CONTEXT GUIDANCE: You have 200k+ tokens.
Do NOT rush or summarize prematurely.
""" + user_input + """
Remember: Context is NOT a constraint.
"""

Problem

Models exhibit 'context anxiety' near window limits, prematurely summarizing or rushing to complete tasks.

Solution

Provide buffer headroom and explicit reassurance that context is abundant to override anxiety behaviors.

When to use

Long coding or research sessions
Tasks requiring sustained attention
Model mentions 'running out of space'

Trade-offs

Pros

Prevents premature task abandonment
Enables more thorough work
Overcomes model behavioral quirks

Cons

Requires model-specific tuning
May increase token usage
Aggressive prompting overhead

View Original →

Dynamic Context Injection

Context & Memory

Flow

User ──▶ "@Button.tsx" ──▶ [Read File] ──▶ Inject
                                              │
User ──▶ "/user:deploy" ──▶ [Load Cmd] ──────┤
                                              ▼
                         Agent [Enriched Context] ──▶ Continue

sequenceDiagram participant U as User participant A as Agent participant FS as File System participant C as Commands U->>A: @src/Button.tsx A->>FS: Read file FS-->>A: Content U->>A: /user:deployment A->>C: Load command C-->>A: Instructions A-->>U: Continue with context

Example

# File injection via @mention
@src/components/Button.tsx

# Slash command injection
/user:deployment
# Loads ~/.claude/commands/deployment.md

# Both inject into agent context

Problem

Agents often need specific context on-demand, but constantly editing static files or pasting text is inefficient.

Solution

Use @mentions for files and /slash commands for reusable prompts to dynamically inject context during sessions.

When to use

Need specific file contents mid-task
Frequently reuse complex instructions
Want fluid context management

Trade-offs

Pros

Targeted context injection
Reusable slash commands
Efficient lazy loading

Cons

Must learn special syntax
May inject too much context
Command setup overhead

View Original →

Filesystem-Based Agent State

Context & Memory

Flow

workspace/
├── state/
│   ├── step1_results.json  ◄── Checkpoint
│   ├── step2_results.json  ◄── Checkpoint
│   └── progress.txt
└── logs/
    └── execution.log

Step1 ──▶ [Save] ──▶ Step2 ──▶ [Save] ──▶ Step3
              │                    │
              ▼                    ▼
         [Interrupt]          [Resume]

flowchart LR A[Start] --> B{Checkpoint Exists?} B -->|Yes| C[Load State] B -->|No| D[Execute Step] C --> D D --> E[Save Checkpoint] E --> F{More Steps?} F -->|Yes| D F -->|No| G[Done]

Example

if os.path.exists("state/step1.json"):
    data = json.load(open("state/step1.json"))
else:
    data = perform_step1()
    json.dump(data, open("state/step1.json", "w"))
# Resume from checkpoint if interrupted

Problem

Long-running agent workflows lose all progress when interrupted, as state in context window does not persist across sessions.

Solution

Persist intermediate results to files, creating durable checkpoints that enable workflow resumption and failure recovery.

When to use

Multi-step workflows with expensive operations
Long-running tasks exceeding session limits
Workflows needing recovery from failures

Trade-offs

Pros

Enables workflow resumption
Protects against data loss

Cons

Requires checkpoint/recovery logic
File I/O adds overhead

View Original →

Layered Configuration Context

Context & Memory

Flow

Enterprise ──┐
  ~/.claude ─┼──▶ Merge ──▶ Agent Context
   project/ ─┤
    .local ──┘

Priority: local > project > user > enterprise

flowchart LR E[Enterprise] --> M[Merge] U[User Global] --> M P[Project] --> M L[Local] --> M M --> C[Context Window]

Example

# Auto-discovered hierarchy
/enterprise/CLAUDE.md    # Org policies
~/.claude/CLAUDE.md      # User prefs
./CLAUDE.md              # Project rules
./CLAUDE.local.md        # Personal overrides

Problem

Manually providing context in every prompt is cumbersome; global context is too broad or narrow.

Solution

Auto-discover and merge layered config files (enterprise, user, project, local) by filesystem hierarchy.

When to use

Multi-project environments
Team-wide policies
Personal customization

Trade-offs

Pros

Zero manual context
Scoped customization

Cons

Merge conflicts
Discovery complexity

View Original →

Curated Code Context Window

Context & Memory

Flow

MainAgent: "Find UserModel definitions"
     │
     ▼
SearchSubagent ──▶ [Index] ──▶ Top 3 snippets
     │
     ▼
Context: [user_service.py] [models/user.py] [auth.py]
     │
     ▼
MainAgent: edit_file(UserService) ✓

sequenceDiagram MainAgent->>SearchSubagent: Find UserModel SearchSubagent-->>MainAgent: 3 file snippets MainAgent->>Context: Inject snippets MainAgent->>Tool: edit_file()

Example

search = SearchSubagent(index="code_index")
snippets = search.find("UserModel", top_k=3)

context.inject(snippets)  # Only 3 files
agent.edit("UserService")  # Focused work

Problem

Dumping entire repositories into context overwhelms the model with noise and slows inference.

Solution

Use a search subagent to inject only top-K relevant code snippets into the main agent's context.

When to use

Working in large codebases
Need focused reasoning on specific modules
Token efficiency is critical

Trade-offs

Pros

Noise reduction improves clarity
Dramatically reduces token usage
Mitigates context anxiety

Cons

Index must stay fresh
Adds search subagent complexity
May miss edge-case dependencies

View Original →

Episodic Memory Retrieval

Context & Memory

Flow

Episode Done ──▶ Write Memory Blob ──▶ [Vector DB]
                                           │
New Task ──▶ Embed Prompt ──▶ Query ───────┘
                               │
                               ▼
                         top-k Memories
                               │
                               ▼
               Agent [Context + Hints] ──▶ Execute

sequenceDiagram participant A as Agent participant V as Vector DB A->>V: Store memory (event, outcome) Note over A: New task arrives A->>V: Query similar memories V-->>A: top-k results A->>A: Inject as context hints A->>A: Execute with memory

Example

# After episode completion
memory_db.write({
    "event": "refactored auth module",
    "outcome": "broke session handling",
    "rationale": "missed dependency"
})

# On new task
hints = memory_db.retrieve(task, top_k=3)
agent.execute(task, context_hints=hints)

Problem

Stateless calls make agents forget prior decisions, causing repetition and shallow reasoning.

Solution

Add vector-backed episodic memory: store event/outcome/rationale blobs, retrieve top-k similar memories, inject as hints.

When to use

Long-running agent sessions
Need continuity across tasks
Avoiding repeated mistakes

Trade-offs

Pros

Richer continuity
Fewer repeated mistakes
Learns from past

Cons

Retrieval noise if uncurated
Storage cost
Stale memory issues

View Original →

Memory Synthesis from Execution Logs

Context & Memory

Flow

Task Diaries              Synthesis
────────────              ─────────
[Diary 1] ──┐
[Diary 2] ──┼──▶ Synthesis Agent ──▶ Patterns
[Diary 3] ──┘              ↓
                    CLAUDE.md / Commands

flowchart LR D1[Diary 1] --> S[Synthesis] D2[Diary 2] --> S D3[Diary 3] --> S S --> P[Patterns] P --> M[CLAUDE.md]

Example

# Synthesis prompt
synthesis_agent("""
Review 50 task diaries.
Find patterns appearing 3+ times.
Output: rules, commands, tests
""")

Problem

Task logs contain valuable learnings but are too specific; hard to know which patterns generalize.

Solution

Write structured task diaries, then periodically synthesize across logs to extract reusable patterns.

When to use

Recurring task types
Learning from failures
Building organizational memory

Trade-offs

Pros

Pattern discovery
Evidence-backed rules

Cons

Storage overhead
False pattern risk

View Original →

Proactive Agent State Externalization

Context & Memory

Flow

Agent ──▶ [Work] ──▶ Write Notes ──▶ [state.md]
                          │
                          ▼
               ┌─────────────────────┐
               │  Template Schema    │
               │  • Objective        │
               │  • Progress         │
               │  • Knowledge Gaps   │
               └─────────────────────┘
                          │
                          ▼
                Validate ──▶ External Memory

flowchart LR A[Agent Work] --> B[Write Notes] B --> C[Template Validation] C --> D{Complete?} D -->|Yes| E[Merge to Memory] D -->|No| F[Prompt Clarification] F --> B

Example

class StateManager:
    def capture_state(self, agent_notes):
        structured = self.parse_notes(agent_notes)
        missing = self.validate(structured)
        if missing:
            return self.prompt_clarification(missing)
        return self.merge_with_memory(structured)

Problem

Models proactively write notes to preserve state, but self-generated summaries are often incomplete and may consume tokens better spent on task execution.

Solution

Provide structured templates and validation for agent self-documentation, combining agent notes with external memory systems as fallback.

When to use

Long-running development sessions
Multi-session research tasks
Subagent coordination requiring state communication

Trade-offs

Pros

Leverages natural model behavior
Enables session continuity
Creates audit trails

Cons

May consume tokens on documentation over progress
Risk of incomplete self-assessment
Requires validation overhead

View Original →

Curated File Context Window

Context & Memory

Flow

Task: "Add validation to signup()"
           │
           ▼
  ┌─────────────────────────────────────┐
  │ PRIMARY: UserController.java (full) │
  ├─────────────────────────────────────┤
  │ CONTEXT SNIPPETS:                   │
  │  - UserService.validateUser()       │
  │  - SignupDTO: fields + annotations  │
  └─────────────────────────────────────┘
           │
           ▼
      Agent: Focused edit ✓

flowchart TB A[Parse task] --> B[Identify primary files] B --> C[Search subagent: rg/index] C --> D[Extract snippets] D --> E[Assemble context] E --> F[Agent executes]

Example

primary = load_full("UserController.java")
secondary = search.find("signup", top_n=5)
snippets = [extract_methods(f) for f in secondary]

context = f"""
### PRIMARY: {primary}
### SNIPPETS: {snippets}
"""

Problem

Loading all files into prompt exceeds token limits and introduces noise from unrelated code.

Solution

Load only primary files plus summarized secondary files from a search subagent's ranked results.

When to use

Multi-file refactoring tasks
Feature implementation in large repos
Need to minimize hallucinations

Trade-offs

Pros

Minimal prompt size, on-target
Faster responses, fewer hallucinations
Scales to large repositories

Cons

Requires file-search service
May miss critical files if ranking is off
Index needs to stay current

View Original →

Agent-Powered Codebase Q&A / Onboarding

Context & Memory

Flow

Developer                  Agent                  Codebase
    │                        │                        │
    │  "Where is DB config?" │                        │
    ├───────────────────────▶│                        │
    │                        │   Search/Index/Analyze │
    │                        ├───────────────────────▶│
    │                        │◀───────────────────────┤
    │  "config/database.js"  │                        │
    │◀───────────────────────┤                        │

sequenceDiagram Developer->>Agent: "Where is the database connection configured?" Agent->>Codebase: Search/Analyze Agent-->>Developer: "It's configured in config/database.js"

Example

# Natural language codebase Q&A
agent.ask("How does user auth work?")
# Agent searches, analyzes, responds:
# "Auth is in auth/service.py, uses JWT,
#  called by UserController.login()"

Problem

대규모 코드베이스 이해와 온보딩이 어렵고, 수동으로 코드 경로를 추적하는 데 시간이 많이 소요됨

Solution

검색/인덱싱/Q&A 능력을 갖춘 AI 에이전트가 자연어 질문에 답하며 코드 구조를 설명

When to use

새 프로젝트에 온보딩할 때
복잡한 시스템 디버깅 시
코드 간 상호작용 파악이 필요할 때
특정 모듈/파일의 목적을 빠르게 이해해야 할 때

Trade-offs

Pros

온보딩 시간 대폭 단축
자연어로 코드베이스 탐색 가능
컨텍스트 기반 정확한 답변

Cons

인덱싱 품질에 답변 품질 의존
대규모 코드베이스 인덱싱 비용

View Original →

Background Agent with CI Feedback

Feedback Loops

Flow

Dev ──▶ Agent: "Upgrade to React 19"
              │
              ▼
        [Push Branch] ──▶ CI Tests
              │               │
              │◀──── 12 fails ┘
              ▼
        [Patch imports]
              │
              ▼
        [Re-run CI] ──▶ All Green
              │
              ▼
        Agent ──▶ Dev: "PR ready!"

sequenceDiagram Dev->>Agent: "Upgrade to React 19" Agent->>Git: push branch Agent-->>CI: trigger tests CI-->>Agent: 12 failures Agent->>Files: patch imports Agent-->>CI: re-run CI-->>Agent: all green Agent-->>Dev: PR ready

Example

# Agent runs in background
git checkout -b react19-upgrade
# Make changes...
git push origin react19-upgrade

# CI runs automatically
# Agent polls for results
# On failure: patch and retry
# On success: notify developer

Problem

Long-running tasks tie up the editor and require developers to babysit the agent throughout execution.

Solution

Run agent asynchronously: push branch, wait for CI, ingest pass/fail output, iterate automatically, and notify when green.

When to use

Long-running upgrade or refactoring tasks
Developer wants to work on other things
Mobile kick-offs (e.g., fix tests while away)

Trade-offs

Pros

Developer freed from babysitting
Uses existing CI as feedback loop
Fully autonomous iteration

Cons

Requires CI integration and permissions
May iterate on wrong direction without oversight

View Original →

Dogfooding with Rapid Iteration

Feedback Loops

Flow

┌─────────────────────────────────────────────┐
│              DOGFOODING LOOP                │
└─────────────────────────────────────────────┘

Build ──▶ Use Daily ──▶ Find Issues ──▶ Fix Fast
  ▲                                        │
  └────────────────────────────────────────┘

Team(70-80%) ──▶ Feedback(5min) ──▶ Iterate

sequenceDiagram participant D as Dev Team participant A as AI Agent participant F as Feedback Channel D->>A: Use for daily tasks A-->>D: Pain points discovered D->>F: Report issue (every 5min) F->>D: Prioritize fix D->>A: Deploy improvement Note over D,A: Tight loop repeats

Example

# Anthropic "ant-fooding" approach
team_adoption = 0.80  # 80% daily usage
feedback_interval = "5min"

# Feature validation flow
feature = push_to_internal()
feedback = collect_team_feedback(feature)
if not feedback.positive:
    unship(feature)  # Fast pivot

Problem

External feedback loops are slow and simulated environments miss real-world nuances for agent improvement.

Solution

Development team uses their own AI agent daily, creating tight feedback loops for rapid iteration and honest assessment.

When to use

Building AI-assisted dev tools
Need rapid feature validation
Team wants unfiltered feedback

Trade-offs

Pros

Direct, immediate feedback
Real-world problem testing
Fast iteration cycles

Cons

May bias toward developer needs
Internal users != all users
Requires team adoption commitment

View Original →

Graph of Thoughts (GoT)

Feedback Loops

Flow

        ┌── T1 ──┐
        │        ▼
Problem ┼── T2 ──┼──▶ Aggregate ──▶ Refine ──▶ Solution
        │        ▲         ▲
        └── T3 ──┘         │
             │             │
             └── Loop ─────┘

Branch → Aggregate → Refine → Loop back if needed

graph TD A[Problem] --> B[Thought 1] A --> C[Thought 2] B --> D[Thought 3] C --> D D --> E[Aggregated] E --> F[Refined] F --> G[Solution]

Example

got = GraphOfThoughts(llm, max_thoughts=50)
got.add_thought(root)
for thought in thoughts_to_expand:
    got.branch_thought(thought)   # Generate alternatives
    got.aggregate_related()       # Combine insights
    got.refine_thought(thought)   # Improve based on context
return got.extract_best_solution()

Problem

Linear Chain-of-Thought reasoning cannot handle problems with complex interdependencies requiring paths that merge, split, and recombine.

Solution

Represent reasoning as a directed graph where thoughts can branch, aggregate, refine, and loop back for iterative improvement.

When to use

Complex problems with interdependent reasoning
Tasks requiring insight aggregation
Problems needing iterative refinement

Trade-offs

Pros

Handles non-linear reasoning
Combines insights from multiple paths

Cons

Higher computational cost
Complex to implement

View Original →

Reflection Loop

Feedback

Flow

                    ┌──────────────────┐
                    │                  │
                    ▼                  │ No
Input ──▶ Generate ──▶ Evaluate ──▶ [score ≥ θ?] ──▶ Done
                                       │ Yes
                    ▲                  │
                    │    Feedback      │
                    └──────────────────┘

Quality:  ░░░░ → ▒▒▒▒ → ▓▓▓▓ → ████

flowchart TD A[Start] --> B[Generate] B --> C[Evaluate] C --> D{Pass?} D -->|Yes| E[Done] D -->|No| F[Feedback] F --> B

Pseudo

for attempt in range(max_iters):
    draft = generate(prompt)
    score, critique = evaluate(draft, metric)

    if score >= threshold:
        return draft

    prompt = incorporate(critique, prompt)

return draft  # best effort

Problem

LLM이 자기 결과물을 검토하지 않으면 품질이 낮거나 요구사항을 충족하지 못할 수 있음

Solution

생성 후 자체 평가 → 피드백 반영 → 재생성 루프를 기준 충족까지 반복

When to use

품질/기준 준수가 중요할 때
코드, 글쓰기, 추론 작업
명확한 평가 메트릭이 있을 때

Trade-offs

Pros

자동 품질 향상
명시적 기준 준수

Cons

추가 연산 비용
메트릭 모호 시 무한루프

View Original →

Spec-As-Test Feedback Loop

Feedback Loops

Flow

Spec Change ──▶ Generate Tests ──▶ Run Tests ──┐
                                                │
    ┌───────────────────────────────────────────┘
    │  Fail?
    ▼
Agent PR: Fix Code or Flag Spec ──▶ Review

flowchart LR A[Spec/Code Commit] --> B[Generate Tests] B --> C[Run Tests] C -->|Pass| D[Done] C -->|Fail| E[Agent PR] E --> F[Fix Code or Flag Spec]

Example

on_commit(spec_or_code):
    tests = generate_tests(spec.latest)
    result = run_tests(tests)
    if result.failed:
        create_pr(fix_or_flag(result))

Problem

Implementations can drift from specs as code evolves, causing silent divergence

Solution

Auto-generate executable tests from specs and run on every commit, with agent-authored PRs for fixes

When to use

Spec-first development workflows
Critical systems requiring spec-impl sync
Continuous integration environments

Trade-offs

Pros

Catches drift early
Keeps spec and impl in lock-step

Cons

Heavy CI usage
False positives with loose spec wording

View Original →

Tool Use Incentivization via Reward Shaping

Feedback Loops

Flow

Agent ──▶ compile() ──▶ [+1.0] ──▶ lint() ──▶ [+0.5]
                │                       │
                └───────────────────────┴──▶ test() ──▶ [+2.0]
                                                         │
                                              ∑ rewards ──▶ Policy Update

flowchart LR A[Agent] --> B[compile] B -->|+1| C[lint] C -->|+0.5| D[test] D -->|+2| E[Sum Rewards] E --> F[Policy Update]

Example

# RL step: shaped rewards for tool calls
if action == "compile":
    local_reward = 1 if compile_success else -0.5
elif action == "run_tests":
    local_reward = 2 if new_tests_passed else 0
trajectory.append((state, action, local_reward))

Problem

Agents underutilize tools (compilers, linters, tests) and default to internal thinking tokens instead of invoking external tools.

Solution

Provide dense shaped rewards for each intermediate tool invocation (+1 compile, +2 test pass) to guide policy toward tool usage.

When to use

RL-based agent training
Tool use is underutilized
Sparse final rewards are insufficient

Trade-offs

Pros

Denser feedback guides step-by-step
Encourages tool adoption

Cons

Reward engineering overhead
May game intermediate rewards

View Original →

Coding Agent CI Feedback Loop

Feedback Loops

Flow

Agent ──▶ [Create Branch] ──▶ CI
              │                  │
              │◀── Partial Fails ┘
              ▼
        [Patch Specific Files]
              │
              ▼
        [Re-run Failed Tests] ──▶ CI
              │                    │
              │◀──── Still Fails? ─┘
              │        │
              │        └──▶ Patch Again
              ▼
        All Green ──▶ User: "PR Ready"

sequenceDiagram Agent->>Git: create branch Agent->>CI: trigger tests loop every 30s CI-->>Agent: partial failures Agent->>Files: patch specific failures Agent->>CI: re-run only failing tests end CI-->>Agent: all green Agent-->>User: PR ready to merge

Example

# Error parsing from CI logs
def parse_ci_failures(logs):
    return [
        {"file": "auth.py", "line": 42,
         "error": "Expected status 200"}
    ]

# Prioritized re-run
agent.run_tests(only_files=patched_files)
# Notify on completion
if all_green: notify_user("PR ready")

Problem

Synchronous test runs block agent from parallel work, creating idle compute and inflated training times as agent babysits builds.

Solution

Run agent async against CI: push branch, poll for partial failures, patch iteratively, notify on final green.

When to use

Multi-file refactors or feature additions
Long test suites that block iteration
Need to maximize compute utilization

Trade-offs

Pros

Compute efficiency - overlaps generation and testing
Faster iteration with less waiting
Autonomous until final green

Cons

CI flakiness can mislead patches
Security - agent needs CI push/read permissions

View Original →

Inference-Healed Code Review Reward

Feedback Loops

Flow

             ┌── Correctness ──▶ 1.0 ──┐
             │                         │
Patch ──▶ Critic ─── Style ──────▶ 0.8 ──┼──▶ Weighted ──▶ 0.7
             │                         │      Sum
             ├── Performance ──▶ 0.4 ──┤
             │                         │
             └── Security ────▶ 0.6 ──┘

+ CoT: "O(n^2) loop caused perf regression"

flowchart LR A[Patch] --> B[Critic] B --> C[Correctness: 1.0] B --> D[Style: 0.8] B --> E[Performance: 0.4] B --> F[Security: 0.6] C & D & E & F --> G[Weighted Sum: 0.7] G --> H[Feedback + Comments]

Example

subscores = {
    "correctness": test_critic.score(patch),
    "style": linter_critic.score(patch),
    "performance": perf_critic.score(patch),
    "security": security_critic.score(patch),
}
final = sum(w[k]*subscores[k] for k in subscores)
return final, subscores, "O(n^2) loop detected"

Problem

Binary 'tests passed' rewards miss nuanced code quality issues like performance regressions, style violations, and security problems.

Solution

Use a multi-criteria code review critic that decomposes quality into subcriteria (correctness, style, perf, security) with explainable subscores.

When to use

RL-based code generation agents
When code quality beyond correctness matters
Continuous improvement of code agents

Trade-offs

Pros

Explainable feedback for targeted fixes
Higher code quality via non-functional criteria

Cons

Compute overhead
Critic model maintenance

View Original →

Rich Feedback Loops

Feedback Loops

Flow

Agent ──▶ [Action] ──▶ [Tool] ──▶ Feedback ──┐
  ▲                                          │
  │       ┌──────────────────────────────────┘
  │       ▼
  └─── [Parse] ◀── errors, test fails, lint, screenshots

Loop: action → feedback → fix → action → ...

sequenceDiagram Agent->>CLI: go test ./... CLI-->>Agent: FAIL auth_test.go:42 Agent->>File: patch handler Agent->>CLI: go test ./... CLI-->>Agent: PASS 87/87 tests

Concept

# Rich feedback > perfect prompts
result = agent.run(task)
feedback = get_diagnostics(result)  # errors, lint
if feedback.has_issues:
    agent.fix(feedback)  # self-debugging loop

Problem

Polishing a single prompt can't cover every edge case. Agents need ground truth to self-correct

Solution

Expose iterative, machine-readable feedback (compiler errors, test failures, lint) after every tool call. Agent uses diagnostics to self-debug

When to use

Code generation tasks
Any task with verifiable output
When tests/linters are available

Trade-offs

Pros

Emergent self-debugging
Better than bigger prompts

Cons

Requires feedback infrastructure
More iterations = more tokens

View Original →

Self-Critique Evaluator Loop

Feedback Loops

Flow

Instruction ──▶ Generate Candidates ──▶ [A, B, C]
                                            │
                                            ▼
                                   Judge: "B > A > C"
                                   (with reasoning)
                                            │
                                            ▼
                                  Fine-tune on Judgments
                                            │
                     ┌──────────────────────┴──────────────────────┐
                     ▼                                             ▼
              Improved Evaluator                          Use as Reward Model

flowchart TD A[Generate Candidates] --> B[Judge with Reasoning] B --> C[Fine-tune Evaluator] C --> D[Improved Judge] D --> B D --> E[Use as Reward Model]

Example

def self_taught_evaluator_loop(model, instructions):
    candidates = [model.generate(i) for i in instructions]
    judgments = model.judge_and_explain(candidates)
    model.finetune(judgments)  # Train on own traces
    return model  # Now better at evaluation

Problem

Human preference labels are costly and quickly become outdated as base models improve, creating a bottleneck for reward model training.

Solution

Train a self-taught evaluator that bootstraps from synthetic data: generate candidates, judge with reasoning traces, fine-tune on its own judgments, and iterate.

When to use

Human labels too expensive to scale
Base models evolving rapidly
Need automated quality gates

Trade-offs

Pros

Near-human eval accuracy without labels
Scales with compute
Self-improving over time

Cons

Risk of evaluator-model collusion
Needs adversarial testing
May amplify systematic errors

View Original →

Self-Discover Reasoning Structures

Feedback Loops

Flow

Task ──▶ Analyze ──▶ Select Modules ──▶ Adapt ──▶ Compose
                         │
         ┌───────────────┴───────────────┐
         │  Module Library               │
         │  • Break into steps           │
         │  • Work backwards             │
         │  • Find patterns              │
         └───────────────────────────────┘
                                              │
                                              ▼
                              Task-Specific Reasoning Structure
                                              │
                                              ▼
                                         Execute ──▶ Solution

flowchart LR A[Task] --> B[Analyze] B --> C[Select Modules] C --> D[Adapt to Task] D --> E[Compose Structure] E --> F[Execute] F --> G[Solution] H[Module Library] --> C

Example

def self_discover_solve(task, modules):
    # Select relevant modules for this task
    selected = llm.select(task, modules)
    # Adapt to specific problem
    adapted = llm.adapt(task, selected)
    # Compose reasoning structure
    structure = llm.compose(adapted)
    return llm.solve_with(task, structure)

Problem

Different reasoning tasks require different thinking strategies. Fixed reasoning patterns like Chain-of-Thought may be suboptimal for diverse problems.

Solution

Enable LLMs to automatically discover and compose task-specific reasoning structures by selecting and adapting atomic reasoning modules to match the problem's characteristics.

When to use

Diverse reasoning tasks
Standard CoT underperforming
Novel problem types

Trade-offs

Pros

Up to 32% improvement over CoT
Creates reusable reasoning templates
Adapts to novel problems

Cons

Overhead for structure discovery
May over-engineer simple problems
Depends on task analysis accuracy

View Original →

Agent Reinforcement Fine-Tuning

Learning & Adaptation

Flow

Sample ──▶ Model ──▶ Tool Call? ──▶ Your Endpoint
              ▲          │              │
              │          ▼              ▼
              │     Final Answer   Add to Context
              │          │              │
              │          ▼              │
              │      Grader ◀───────────┘
              │          │
              └── Reward ┘

[End-to-end training with real tools]

flowchart TD A[Training Sample] --> B[Model Rollout] B --> C{Tool Call?} C -->|Yes| D[Call Tool Endpoint] D --> E[Add Response to Context] E --> B C -->|No| F[Final Answer] F --> G[Grader Evaluates] G --> H[Reward Signal] H --> I[Update Weights]

Example

client.fine_tuning.jobs.create(
    model="gpt-4o",
    method="rft",
    rft={
        "tools": [{"url": "https://api/search"}],
        "grader": {"type": "model", "model": "gpt-4o"},
        "hyperparameters": {"compute_multiplier": 1}
    }
)

Problem

기본 모델은 분포 이동과 비효율적인 도구 사용으로 인해 도메인 특화 작업에서 성능이 떨어짐

Solution

실제 도구 호출, 커스텀 그레이더, 다단계 강화학습으로 에이전트 작업에 대해 모델을 엔드투엔드로 학습

When to use

기본 모델과의 분포 이동
비효율적인 도구 사용 패턴
100-1000개의 품질 샘플이 있을 때

Trade-offs

Pros

엔드투엔드 최적화
샘플 효율적 (100-1000 샘플)

Cons

인프라 복잡성
신중한 보상 엔지니어링 필요

View Original →

AI-Assisted Code Review

Feedback Loops

Flow

Code ──▶ AI Analyzer ──▶ Issues/Summary ──▶ Human Review
  │                            │                   │
  │   "Explain this change?"   │                   │
  └────────────────────────────┴───── Q&A ◀───────┘

Intent Alignment: [Mind's Eye] ◀─── Verify ─── [Generated Code]

flowchart LR A[Code Changes] --> B[AI Analyzer] B --> C[Issues & Summary] C --> D[Human Reviewer] D --> E{Aligned?} E -- Yes --> F[Approve] E -- No --> G[Request Changes] D --> H[Ask AI to Explain] H --> D

Example

# PR Review workflow
ai_review = agent.analyze_pr(diff)
print(ai_review.summary)
print(ai_review.issues)

# Interactive Q&A
answer = agent.explain("Why was this refactored?")
if aligned_with_intent(answer):
    pr.approve()

Problem

AI가 생성하는 코드량이 증가하면서 검증이 병목이 됨. 문법적 정확성뿐 아니라 의도 부합 여부 확인이 중요해짐

Solution

AI로 코드 변경사항을 분석하고, 이슈/요약을 제공하며, 인터랙티브 Q&A로 의도 정렬을 검증

When to use

AI 생성 코드의 PR 리뷰 시
대규모 코드베이스 변경 검토
명세가 모호한 작업의 결과 검증
팀 전체 리뷰 효율성 향상 필요 시

Trade-offs

Pros

리뷰 속도 및 효율성 향상
AI 설명으로 이해도 증가
의도 정렬 검증 가능

Cons

AI 분석의 정확도 의존
과도한 신뢰로 인한 오류 간과 위험

View Original →

Compounding Engineering Pattern

Learning & Adaptation

Flow

┌─────────────────────────────────────────────────────┐
│               COMPOUNDING LOOP                      │
│                                                     │
│  Build ──▶ Learn ──▶ Codify ──▶ Easier Build       │
│    ▲                               │                │
│    └───────────────────────────────┘                │
│                                                     │
│  Outputs: CLAUDE.md, /commands, hooks, subagents   │
└─────────────────────────────────────────────────────┘

flowchart LR A[Build Feature] --> B[Document Learnings] B --> C[Codify into Prompts] C --> D[Next Feature] D --> E[Easier Build] E --> A

Example

# After completing feature:
1. Update CLAUDE.md with patterns
2. Create /test-with-validation command
3. Add pre-commit hook for edge cases
4. Build security-review subagent

Problem

Traditional engineering has diminishing returns; AI agents repeat mistakes because learnings aren't codified.

Solution

Codify all learnings from each feature into prompts, commands, and hooks to make subsequent features easier.

When to use

Building features with AI agents
Onboarding new team members
Agents repeating the same mistakes

Trade-offs

Pros

Accelerating productivity over time
Knowledge preserved beyond individuals
Better agent and human onboarding

Cons

Upfront documentation time
Risk of prompt bloat
Requires extensible agent system

View Original →

Reflection Loop

Feedback Loops

Flow

         ┌─────────────────────────┐
         │                         │
         ▼                         │
Prompt ──▶ Generate ──▶ Evaluate ──┤
              │           │        │
              ▼           ▼        │
           Draft    score < threshold?
                          │ yes    │
                          └────────┘
                          │ no
                          ▼
                       Return

flowchart TD A[Prompt] --> B[Generate Draft] B --> C[Evaluate] C --> D{score >= threshold?} D -->|No| E[Incorporate Feedback] E --> B D -->|Yes| F[Return Draft]

Example

for attempt in range(max_iters):
    draft = generate(prompt)
    score, critique = evaluate(draft)
    if score >= threshold:
        return draft
    prompt = incorporate(critique, prompt)

Problem

생성 모델이 자신의 출력을 검토하거나 비평하지 않으면 품질이 낮은 결과물을 생성할 수 있음

Solution

초안 생성 후 모델이 스스로 평가하고 피드백을 반영하여 반복적으로 개선

When to use

품질이나 명시적 기준 준수가 중요할 때
글쓰기, 추론, 코드 작성 작업
자동 개선이 필요한 경우
임계값 도달까지 반복 가능할 때

Trade-offs

Pros

적은 감독으로 출력 품질 향상
명시적 기준 대비 자가 평가 가능

Cons

추가 연산 비용 발생
평가 지표가 잘못 정의되면 개선 정체

View Original →

Skill Library Evolution

Learning & Adaptation

Flow

Ad-hoc ──▶ Save ──▶ Reusable ──▶ Documented ──▶ Capability
  │         │          │            │              │
  └─────────┴──────────┴────────────┴──────────────┘
                    skills/
              ├── sentiment.py
              ├── pdf_convert.py
              └── api_summary.py

graph LR A[Ad-hoc Code] --> B[Save Working Solution] B --> C[Reusable Function] C --> D[Documented Skill] D --> E[Agent Capability]

Concept

# Session 1: Save working solution
with open("skills/sentiment.py", "w") as f:
    f.write(working_code)

# Session N: Reuse existing skill
from skills.sentiment import analyze
result = analyze(text)  # no rediscovery

Problem

Agents solve similar problems across sessions but must rediscover solutions each time, wasting tokens

Solution

Persist working code as reusable functions in skills/ directory. Over time, evolve into documented, tested capabilities

When to use

Repetitive problem-solving across sessions
Organization wants agents to build capability over time
Code reuse is valuable

Trade-offs

Pros

Builds agent capability over time
Reduces token consumption

Cons

Requires discipline to organize
Skills can become stale

View Original →

Variance-Based RL Sample Selection

Learning

Flow

Score
1.0 ●━━━━●━━━━●━━━━●     ← Always correct (no learning)
    ┃
0.5 ┃  ●━━●━━●           ← HIGH VARIANCE (train here!)
    ┃  ┃  ▼
0.0 ●━━●━━━━●━━━━●━━━●   ← Always wrong (no learning)
    └──┴──┴──┴──┴──┴──▶
       Sample Index

Legend: ● best │ ━ mean │ ▼ variance range

flowchart LR A[Dataset] --> B[Run 3-5 Baselines] B --> C[Plot Variance] C --> D{Variance > 0?} D -->|Yes| E[Train on this] D -->|No| F[Skip sample]

Example

# Run multiple evals per sample
for sample in dataset:
    scores = [agent.eval(sample) for _ in range(3)]
    variance = np.var(scores)
    if variance > 0.01 and 0 < np.mean(scores) < 1:
        high_variance_samples.append(sample)

Problem

Zero-variance samples (always correct or always wrong) provide no learning signal, wasting compute in RL training.

Solution

Run multiple baseline evals per sample, plot variance, prioritize high-variance samples where model sometimes succeeds.

When to use

RL training with limited budget
Dataset may contain many solved/unsolvable samples
Need to estimate improvement potential

Trade-offs

Pros

Data efficiency - focus on learnable samples
Predictive - estimate potential before training

Cons

Upfront eval cost (3-5x baselines)
Variance changes during training

View Original →

Sub-Agent Spawning

Orchestration

Flow

                       ┌─── Sub1 ──▶ [████] ───┐
                       │                       │
Main ──▶ Split(n) ────┼─── Sub2 ──▶ [████] ───┼──▶ Merge ──▶ Done
                       │                       │
                       └─── Sub3 ──▶ [████] ───┘

Context:  Main[████████████]  →  Sub[██] Sub[██] Sub[██]

sequenceDiagram participant M as Main participant S1 as Sub 1 participant S2 as Sub 2 participant S3 as Sub 3 M->>M: Split (36 files) par Parallel M->>S1: 12 files M->>S2: 12 files M->>S3: 12 files end S1-->>M: done S2-->>M: done S3-->>M: done M->>M: Merge → PR

Example

main_agent.spawn_subagents(
    task="Refactor YAML front-matter",
    files=glob("*.md"),  # 36 files
    agents=3,
    per_agent=12
)
# Each subagent gets fresh context
# Main agent merges results

Problem

대규모 멀티파일 작업 시 메인 에이전트의 컨텍스트 윈도우가 폭발하여 추론 예산을 초과

Solution

독립된 서브에이전트를 생성하여 병렬로 분산 작업 후 결과 취합

When to use

멀티파일 수정 (10개 이상)
독립적으로 분할 가능한 작업
순차 처리로는 너무 느릴 때

Trade-offs

Pros

병렬 처리로 속도 향상
컨텍스트 격리

Cons

에이전트 간 조율 복잡도
결과 병합 비용

View Original →

Action-Selector Pattern

Orchestration & Control

Flow

User Prompt ──▶ LLM ──▶ Action ID ──▶ Execute
                 │                        │
                 │    ┌───────────────────┘
                 │    │
                 ▼    ▼
            Allowlist Check    Tool Output
                              (NOT fed back!)

[LLM as decoder only - no feedback loop]

flowchart LR A[User Request] --> B[LLM Decoder] B --> C{Match Allowlist?} C -->|Yes| D[Execute Action] C -->|No| E[Reject] D --> F[Return Result] F -.->|NOT fed back| B

Example

allowlist = ["check_balance", "transfer", "history"]
action = llm.translate(user_prompt, allowlist)
result = execute(action)
# CRITICAL: result NOT returned to LLM
return result  # Direct to user

Problem

도구 피드백이 컨텍스트 윈도우로 재진입할 때 신뢰할 수 없는 입력이 에이전트 추론을 하이재킹할 수 있음

Solution

LLM을 요청을 사전 승인된 액션으로 매핑하는 데만 사용하고, 도구 출력은 모델에 피드백하지 않음

When to use

고보안 환경
고객 서비스 봇
제한된 액션 세트 (키오스크, 라우터)

Trade-offs

Pros

프롬프트 인젝션에 거의 면역
감사가 매우 쉬움

Cons

유연성 제한
새 기능 추가 시 코드 수정 필요

View Original →

Autonomous Workflow Agent Architecture

Orchestration

Flow

Workflow ──▶ Container ──▶ tmux Sessions
                              │
            ┌─────────────────┼─────────────────┐
            ▼                 ▼                 ▼
       [Session 1]      [Session 2]      [Session 3]
            │                 │                 │
            └─────────┬───────┘                 │
                      ▼                         │
               Monitor & Wait ◀─────────────────┘
                      │
              Error? ─┼─▶ Retry/Recover
                      │
                      ▼
               Checkpoint ──▶ Next Step

flowchart TD A[Workflow] --> B[Container] B --> C[tmux Sessions] C --> D[Parallel Execution] D --> E[Intelligent Monitoring] E --> F{Error?} F -->|Yes| G[Adaptive Recovery] G --> C F -->|No| H[Checkpoint] H --> I{Complete?} I -->|No| C I -->|Yes| J[Results]

Example

class WorkflowAgent:
    def execute_workflow(self, steps):
        for step in steps:
            session = self.create_session(step.name)
            try:
                result = self.execute_step(step, session)
                self.create_checkpoint(step, result)
            except Exception as e:
                if self.can_retry(e):
                    self.retry_with_backoff(step)
                else:
                    self.escalate_to_human(step, e)

Problem

Complex engineering workflows require extensive human oversight, with manual coordination, monitoring, and intervention at each step.

Solution

Containerized agents with tmux session management, intelligent monitoring, and context-aware error recovery for autonomous multi-step execution.

When to use

Model training and evaluation pipelines
Infrastructure provisioning and configuration
Multi-stage deployment workflows

Trade-offs

Pros

1.2-1.4x speedup in workflow execution
Reduced human intervention for routine steps
Comprehensive automatic logging

Cons

Limited handling of novel failure scenarios
Context window constraints for long workflows
Setup complexity for containers and monitoring

View Original →

Continuous Autonomous Task Loop

Orchestration & Control

Flow

┌──────────────────────────────────────────────┐
│              AUTONOMOUS LOOP                 │
│                                              │
│  Pick Task ──▶ Execute ──▶ Commit ──▶ Next  │
│      ▲                           │           │
│      └───────────────────────────┘           │
│                                              │
│  Rate Limit? ──▶ Exponential Backoff ──▶ ↺  │
└──────────────────────────────────────────────┘

sequenceDiagram loop Until done Script->>TaskAgent: Pick next task Script->>MainAgent: Execute autonomously Script->>GitAgent: Auto-commit alt Rate Limited Script->>Script: Backoff wait end end

Example

MAX_ITERATIONS=50
BACKOFF=300  # seconds

while [ $i -lt $MAX_ITERATIONS ]; do
  task=$(claude "Pick next from TODO.md")
  claude --auto-accept "$task"
  git add -A && git commit -m "$task"
done

Problem

Manual task orchestration—selection, commits, rate-limit handling—interrupts developer flow.

Solution

Implement continuous loop with fresh context per task, auto-commits, and intelligent backoff.

When to use

Batch processing TODO lists
Overnight autonomous development
Discrete, well-defined tasks

Trade-offs

Pros

Complete autonomy
Fresh context per task
Handles rate limits gracefully

Cons

Reduced human oversight
Elevated permission requirements
Risk of runaway execution

View Original →

Distributed Execution with Cloud Workers

Orchestration

Flow

                    ┌──▶ Worker1 [Worktree-A] ──┐
                    │                           │
Coordinator ──▶ ────┼──▶ Worker2 [Worktree-B] ──┼──▶ Merge ──▶ main
                    │                           │
                    └──▶ Worker3 [Worktree-C] ──┘

Tasks:  [████████] → [██] [██] [██] → [████████]

sequenceDiagram participant C as Coordinator participant W1 as Worker 1 participant W2 as Worker 2 participant W3 as Worker 3 C->>C: Decompose tasks par Parallel Execution C->>W1: Task A (worktree-1) C->>W2: Task B (worktree-2) C->>W3: Task C (worktree-3) end W1-->>C: PR ready W2-->>C: PR ready W3-->>C: PR ready C->>C: Merge all → main

Example

coordinator.deploy_workers(
    tasks=["refactor auth", "update API", "fix tests"],
    workers=3,
    git_worktrees=True
)
# Each worker: isolated worktree + Claude session
# Coordinator: monitors, merges PRs

Problem

Single-session AI agent execution cannot scale to meet enterprise team demands with multiple simultaneous code changes.

Solution

Run multiple Claude sessions in parallel using git worktrees and cloud workers with centralized coordination.

When to use

Team-wide code migrations
Parallel feature development
Large-scale infrastructure changes

Trade-offs

Pros

10x-100x parallelization speedup
Scales to enterprise teams
Centralized monitoring

Cons

Significant infrastructure complexity
Merge conflict overhead
Higher parallel model costs

View Original →

Feature List as Immutable Contract

Orchestration & Control

Flow

feature-list.json (IMMUTABLE)
┌─────────────────────────────────┐
│ auth-001: [░] New chat button   │
│ auth-002: [░] Logout function   │◄── Agent can ONLY
│ ui-001:   [█] Dark mode         │    set passes=true
│ ui-002:   [░] Responsive nav    │
│ api-001:  [░] Rate limiting     │◄── Cannot DELETE
└─────────────────────────────────┘    or MODIFY

Agent ──▶ Implement ──▶ Test ──▶ [█] Mark Done

graph TD A[Feature List Created] --> B[All passes=false] B --> C{Select Next Feature} C --> D[Implement] D --> E[Test] E --> F{All Steps Pass?} F -->|No| D F -->|Yes| G[Set passes=true] G --> H{More Features?} H -->|Yes| C H -->|No| I[Complete]

Example

{
  "features": [{
    "id": "auth-001",
    "description": "New chat creates conversation",
    "steps": ["Click button", "Verify URL"],
    "passes": false  // Agent can ONLY flip to true
  }]
}

Problem

Long-running agents declare premature victory, delete tests to pass, or lose track of requirements across sessions.

Solution

Define all features upfront in an immutable JSON that agents can mark complete but cannot modify or delete.

When to use

Building complete applications with known requirements
Projects spanning many agent sessions
When agent accountability is critical

Trade-offs

Pros

Prevents premature completion claims
Eliminates 'pass by deletion' attacks

Cons

Requires upfront feature specification
Rigid for changing requirements

View Original →

Iterative Multi-Agent Brainstorming

Orchestration & Control

Flow

              ┌─── Agent A ──▶ Ideas 1 ───┐
              │                           │
Problem ──▶   ├─── Agent B ──▶ Ideas 2 ───┼──▶ Synthesize
              │                           │
              └─── Agent C ──▶ Ideas 3 ───┘

Perspectives: [Performance] [Security] [UX]

flowchart LR P[Problem] --> A1[Agent 1] P --> A2[Agent 2] P --> A3[Agent 3] A1 --> S[Synthesize] A2 --> S A3 --> S S --> R[Best Ideas]

Example

# Parallel brainstorming agents
ideas = await asyncio.gather(
    agent("Refactor for performance"),
    agent("Refactor for readability"),
    agent("Refactor for testability")
)
best = synthesize(ideas)

Problem

Single agent gets stuck in local optimum or fails to explore diverse solutions for complex problems.

Solution

Spawn multiple agents with different perspectives to brainstorm in parallel, then synthesize best ideas.

When to use

Creative ideation tasks
Complex refactoring decisions
Design exploration

Trade-offs

Pros

Diverse perspectives
Avoids local optima

Cons

Higher token cost
Synthesis complexity

View Original →

Multi-Model Orchestration for Complex Edits

Orchestration & Control

Flow

User Request ──▶ Retrieval Model ──▶ Generation Model ──▶ Apply Model ──▶ Done
                     │                    │                   │
                 (context)            (edits)            (patches)

flowchart LR A[User Request] --> B[Retrieval Model] B --> C[Generation Model] C --> D[Apply Model] D --> E[Edited Codebase]

Example

# Multi-model pipeline
context = retrieval_model.gather(codebase)
edits = generation_model.generate(
    context, user_request
)  # Claude Sonnet
apply_model.patch(edits)  # Custom

Problem

A single model may not be optimal for all sub-tasks in complex operations like multi-file code editing.

Solution

Pipeline multiple specialized models: retrieval model for context, generation model for edits, and application model for changes.

When to use

Multi-file code editing tasks
Complex operations requiring different skills
Tasks with distinct retrieval, generation, and application phases

Trade-offs

Pros

Leverages each model's strengths
More robust outcomes than single model

Cons

Orchestration complexity
Latency from multiple model calls

View Original →

Plan-Then-Execute Pattern

Orchestration & Control

Flow

         PLAN PHASE              EXECUTE PHASE
             │                        │
Prompt ──▶ [LLM] ──▶ Plan ──▶ [Controller] ──▶ Results
             │        │              │
             │    ┌───┴───┐     ┌────┴────┐
             │    │ call1 │     │ run(1)  │
             │    │ call2 │────▶│ run(2)  │
             │    │ call3 │     │ run(3)  │
             │    └───────┘     └─────────┘
             │                       │
          frozen               no plan changes

flowchart LR A[Prompt] --> B[LLM: Make Plan] B --> C[Frozen Plan] C --> D[Execute call 1] D --> E[Execute call 2] E --> F[Execute call 3] F --> G[Results]

Example

plan = LLM.make_plan(prompt)  # frozen
for call in plan:
    result = tools.run(call)
    stash(result)  # outputs isolated
# Tool outputs can't change which tools run

Problem

If tool outputs alter choice of later actions, injected instructions may redirect agent toward malicious steps

Solution

Split into Plan phase (fixed sequence before seeing untrusted data) and Execute phase (controller runs exact sequence)

When to use

Email & calendar bots
SQL assistants
Tasks where action set is known but params vary

Trade-offs

Pros

Strong control-flow integrity
2-3x success rates for complex tasks

Cons

Output content can still be poisoned
Less flexible for dynamic tasks

View Original →

Self-Rewriting Meta-Prompt Loop

Orchestration

Flow

Episode ──▶ Reflect ──▶ Draft Delta ──▶ Validate ──┐
                                                    │
              ┌────────────────────────────────────-┘
              ▼
        [System Prompt v1] ──▶ [System Prompt v2] ──▶ Next Episode

flowchart LR A[Episode] --> B[Reflect] B --> C[Draft Edits] C --> D{Validate} D -->|Pass| E[Update Prompt] D -->|Fail| B E --> F[Next Episode]

Example

dialogue = run_episode()
delta = LLM("Propose prompt edits", dialogue)
if passes_guardrails(delta):
    system_prompt += delta
    save(system_prompt)

Problem

Static system prompts become stale as agents encounter new tasks and edge cases

Solution

Let the agent rewrite its own system prompt after each interaction through reflection and validation

When to use

Agent encounters recurring failures
Prompt needs frequent minor tweaks
Continuous learning without human intervention

Trade-offs

Pros

Rapid adaptation to new scenarios
No human in the loop for minor tweaks

Cons

Risk of drift or jailbreak
Requires strong guardrails

View Original →

Three-Stage Perception Architecture

Orchestration

Flow

┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│  PERCEPTION   │   │  PROCESSING   │   │    ACTION     │
│───────────────│   │───────────────│   │───────────────│
│ Text/Image/   │   │  Reasoning    │   │ API Calls     │
│ Audio Input   │──▶│  Decision     │──▶│ File Ops      │
│ Normalization │   │  Validation   │   │ Notifications │
└───────────────┘   └───────────────┘   └───────────────┘
                            │
                            ▼
                    [ Feedback Loop ]

flowchart LR subgraph P[Perception] A[Raw Input] --> B[Normalize] end subgraph R[Processing] B --> C[Reason] --> D[Decide] end subgraph X[Action] D --> E[Execute] end E --> F[Feedback] F --> C

Example

class ThreeStageAgent:
    async def run(self, raw_input):
        # Stage 1: Perception
        data = await self.perception.process(raw_input)
        # Stage 2: Processing
        decisions = await self.processor.analyze(data)
        # Stage 3: Action
        return await self.action.execute(decisions)

Problem

Monolithic agents mixing perception, reasoning, and action are hard to debug, extend, or scale independently.

Solution

Separate workflow into three stages: Perception (input normalization), Processing (reasoning), and Action (execution).

When to use

Complex multi-modal inputs
Need independent component scaling
Team collaboration per stage

Trade-offs

Pros

Clean separation of concerns
Better error isolation

Cons

Additional complexity for simple tasks
Latency from stage transitions

View Original →

Agent-Driven Research

Orchestration & Control

Flow

Question ──▶ Formulate Query ──▶ Search ──▶ Analyze
                    ▲                          │
                    │    Sufficient? ◀─────────┘
                    │         │
                    │    No   │ Yes
                    └─────────┘   │
                                  ▼
                            Synthesize & Present

flowchart TD A[Research Question] --> B[Formulate Query] B --> C[Execute Search] C --> D[Analyze Results] D --> E{Sufficient?} E -->|No| F[Refine Strategy] F --> B E -->|Yes| G[Synthesize Findings]

Example

while not agent.satisfied():
    query = agent.formulate_query(question)
    results = search_tool.execute(query)
    agent.analyze(results)
    agent.refine_strategy()
return agent.synthesize_findings()

Problem

전통적인 리서치는 새로운 결과에 기반해 검색 전략을 동적으로 조정하는 능력이 부족함

Solution

에이전트가 독립적으로 쿼리를 생성하고, 검색을 실행하고, 결과를 분석하고, 만족할 때까지 반복하도록 함

When to use

복잡한 리서치 질문
다중 소스 정보 수집
적응적 검색이 필요할 때

Trade-offs

Pros

적응적 검색 전략
자율적인 반복

Cons

명백한 소스를 놓칠 수 있음
반복에 따른 토큰 비용

View Original →

Discrete Phase Separation

Orchestration & Control

Flow

┌──────────┐    Findings    ┌──────────┐    Plan    ┌──────────┐
│ Research │ ────────────▶  │ Planning │ ────────▶  │ Execute  │
│  (Opus)  │                │  (Opus)  │            │ (Sonnet) │
└──────────┘                └──────────┘            └──────────┘
     ↓                           ↓                       ↓
[Fresh ctx]               [Fresh ctx]              [Fresh ctx]

Key: Pass distilled conclusions, NOT full history

flowchart LR A[Research
Opus] -->|Distilled findings| B[Planning
Opus] B -->|Implementation plan| C[Execute
Sonnet]

Example

# Phase 1: Research (new conversation)
findings = opus.research("OAuth flows in codebase")

# Phase 2: Plan (new conversation)
plan = opus.plan(findings, "Add Google OAuth")

# Phase 3: Execute (new conversation)
sonnet.implement(plan, step=1)

Problem

Simultaneous research, planning, and implementation causes context contamination and degraded output.

Solution

Break workflow into isolated phases (Research, Plan, Execute) with clean handoffs of distilled conclusions.

When to use

Complex features needing background research
Refactoring projects with architectural decisions
Mixing research and implementation hurts quality

Trade-offs

Pros

Higher quality per phase
Prevents context contamination
Leverages model-specific strengths

Cons

Requires explicit phase management
Feels slower for simple tasks
Higher total token usage

View Original →

Dual LLM Pattern

Orchestration

Flow

┌─────────────────┐      ┌─────────────────┐
│  Quarantined    │      │   Privileged    │
│     LLM         │      │      LLM        │
│                 │      │                 │
│ [Reads Data]    │ $VAR │ [Plans + Tools] │
│ [No Tools]      │ ───▶ │ [No Raw Data]   │
└─────────────────┘      └────────┬────────┘
        ▲                         │
        │                         ▼
   Untrusted              execute(plan, $VAR)
     Input

Example

var1 = QuarantineLLM("extract email", text)
# Returns symbolic: $VAR1

plan = PrivLLM.plan("send $VAR1 to boss")
# No raw text exposure

execute(plan, subst={"$VAR1": var1})

Problem

A privileged agent that sees untrusted text AND wields tools can be coerced into dangerous calls.

Solution

Split into Privileged LLM (plans/tools, no raw data) and Quarantined LLM (reads data, no tools), passing data as symbolic variables.

When to use

Email/calendar assistants
Booking agents handling user data
API-powered chatbots

Trade-offs

Pros

Clear trust boundary
Compatible with static analysis
Prevents injection attacks

Cons

Increased complexity
Debugging across two models
Variable mapping overhead

View Original →

Inference-Time Scaling

Orchestration & Control

Flow

Problem ──▶ Assess Difficulty
                    │
      ┌─────────────┼─────────────┐
      ▼             ▼             ▼
    [Low]       [Medium]       [High]
      │             │             │
  Standard    N Attempts    Deep Search
      │             │             │
      └─────────────┴─────────────┘
                    │
               Select Best ──▶ Answer

flowchart TD A[Problem] --> B{Difficulty?} B -->|Low| C[Standard] B -->|Medium| D[N Attempts] B -->|High| E[Deep Reasoning] C & D & E --> F[Select Best] F --> G[Answer]

Example

def solve_with_scaling(problem, budget=100):
    difficulty = estimate_difficulty(problem)
    if difficulty < 0.3:
        return standard_inference(problem)
    elif difficulty < 0.7:
        return best_of_n(problem, n=5)
    else:
        return deep_reasoning_with_search(problem)

Problem

Once trained, model performance is fixed. We cannot 'think harder' by allocating more compute for challenging problems.

Solution

Allocate additional inference compute dynamically: generate multiple candidates, perform extended reasoning, iterate and refine outputs.

When to use

Complex reasoning tasks
Problems with verifiable solutions
When latency is acceptable

Trade-offs

Pros

Dramatically improves hard tasks
More cost-effective than larger models

Cons

Increased latency
Higher inference costs

View Original →

Language Agent Tree Search (LATS)

Orchestration & Control

Flow

        [Root]
       /   |   \
     [A]  [B]  [C]    ← Expand candidates
      |    |
    [A1] [B1]         ← UCB select best
      |
    [A1a]             ← Evaluate & backprop

Select ──▶ Expand ──▶ Evaluate ──▶ Backprop

flowchart TB R[Root] --> A[Path A] R --> B[Path B] A --> A1[A.1] A1 --> A2[A.2 Best] B --> B1[B.1] style A2 fill:#90EE90

Example

def search(root, iterations=50):
    for _ in range(iterations):
        node = select(root)  # UCB
        children = expand(node)  # LLM
        value = evaluate(node)  # LLM
        backpropagate(node, value)
    return best_path(root)

Problem

Linear reasoning (ReACT) gets stuck in local optima on complex tasks requiring strategic exploration.

Solution

Apply Monte Carlo Tree Search (MCTS) with LLM for action generation, evaluation, and backpropagation.

When to use

Multi-step reasoning
Strategic planning
Multiple valid approaches

Trade-offs

Pros

Systematic exploration
Outperforms ReACT

Cons

High LLM call cost
Parameter tuning

View Original →

Opponent Processor / Multi-Agent Debate

Orchestration & Control

Flow

Task ──┬──▶ [Advocate] ──▶ Propose ──┐
       │                             │
       └──▶ [Critic]   ──▶ Challenge ┼──▶ Debate ──▶ Synthesis
                                     │
                              (Iterate)

flowchart LR A[Task] --> B[Advocate] A --> C[Critic] B --> D[Debate] C --> D D --> E[Synthesized Decision]

Example

# Debate pattern for expenses
advocate = Agent(role="user_advocate")
auditor = Agent(role="company_auditor")

proposal = advocate.classify(expense)
challenge = auditor.review(proposal)
final = synthesize(proposal, challenge)

Problem

Single-agent decisions suffer from confirmation bias, limited perspectives, and unexamined assumptions.

Solution

Spawn opposing agents with different goals to debate each other, surfacing blind spots and unconsidered alternatives.

When to use

Decisions requiring balanced perspectives
High-stakes choices needing scrutiny
Tasks prone to confirmation bias

Trade-offs

Pros

Reduces bias through adversarial pressure
Surfaces blind spots and trade-offs

Cons

2x+ token cost
May deadlock without resolution mechanism

View Original →

Progressive Autonomy with Model Evolution

Orchestration & Control

Flow

Model v1 ──▶ Heavy Scaffolding ──▶ [2000 tokens prompt]
    │
    ▼ (model upgrade)
Model v2 ──▶ Audit & Remove ──▶ [500 tokens prompt]
    │
    ▼ (model upgrade)
Model v3 ──▶ Minimal Prompt ──▶ [100 tokens prompt]

flowchart LR A[Model v1] --> B[Heavy Scaffolding] B --> C[Model v2 Released] C --> D[Remove Unnecessary] D --> E[Model v3 Released] E --> F[Minimal Prompt]

Example

# Before: 2000 tokens of instructions
system_prompt_v1 = """Check file exists, read contents,
plan changes, make minimal edits, verify syntax..."""

# After: Model internalized the steps
system_prompt_v2 = "Write clean, tested code."

Problem

Agent scaffolding built for older models becomes unnecessary overhead as models improve, creating prompt bloat, wasted tokens, and maintenance burden.

Solution

Actively remove scaffolding as models become more capable. Regularly audit system prompts and orchestration logic to eliminate what newer models have internalized.

When to use

New model releases available
System prompts exceeding necessary length
Complex orchestration for simple tasks

Trade-offs

Pros

Reduced token costs
Faster execution
Simpler maintenance

Cons

Requires testing for quality validation
May need different configs per model version
Loss of explicit control

View Original →

Specification-Driven Agent Development

Orchestration

Flow

Spec File ──▶ Parse ──▶ Task Graph ──▶ Scaffold
   (MD/JSON)                            │
                                        ▼
                        Generated Code ◀── [links to spec clause]

flowchart LR A[Spec File] --> B[Parse Spec] B --> C[Build Task Graph] C --> D[Scaffold Code] D --> E[Link to Spec Clauses] E --> F[Iterate via Spec Edits]

Example

if new_feature_requested:
    write_spec(update)
    agent.sync_with(spec)
# All artifacts link back to spec clauses

Problem

Hand-crafted prompts leave room for ambiguity; agents can over-interpret or conflict with stakeholder intent

Solution

Use a formal spec file as the agent's primary input and source of truth, iterating only by editing the spec

When to use

Complex multi-step feature development
Audit-friendly, repeatable workflows
Team collaboration requiring clear contracts

Trade-offs

Pros

Repeatable, audit-friendly, easy diffing
Clear artifact traceability to spec

Cons

Up-front spec writing effort
Ramp-up for teams new to spec formats

View Original →

Tool Capability Compartmentalization

Orchestration

Flow

        Monolithic Tool (RISKY)          Compartmentalized (SAFE)
        ┌─────────────────────┐           ┌─────────┐
        │  read + fetch +     │           │ READER  │ fs:read-only
        │  write ALL-IN-ONE   │    ──▶    └────┬────┘
        │  🔓 max surface     │                │  consent
        └─────────────────────┘           ┌────▼────┐
                                          │PROCESSOR│ net:allowlist
                                          └────┬────┘
                                               │  consent
                                          ┌────▼────┐
                                          │ WRITER  │ scoped perms
                                          └─────────┘

flowchart LR subgraph Old[Monolithic] A[Read+Fetch+Write] end subgraph New[Compartmentalized] R[Reader] -->|consent| P[Processor] P -->|consent| W[Writer] end Old --> New

Example

# tool-manifest.yml
email_reader:
  capabilities: [private_data, untrusted_input]
  permissions:
    fs: read-only:/mail
    net: none
issue_creator:
  capabilities: [external_comm]
  permissions:
    net: allowlist:github.com

Problem

Mix-and-match tools combining data readers, web fetchers, and writers amplify prompt-injection attack chains.

Solution

Split tools into reader/processor/writer micro-tools with isolated permissions and explicit per-call consent.

When to use

Tools handle private data
Multiple capability types combined
Security is high priority

Trade-offs

Pros

Fine-grained security control
Plays well with modular architectures

Cons

More tooling overhead
Permission creep over time

View Original →

Disposable Scaffolding Over Durable Features

Orchestration & Control

Flow

┌─────────────────────────────────────────────────┐
│           THE BITTER LESSON CYCLE               │
│                                                 │
│  New Model ──▶ Eval Scaffolding ──▶ Obsolete?  │
│       ▲               │                │        │
│       │          Still needed?     Discard      │
│       │               │                │        │
│       │          ──▶ Keep minimal ◀────┘        │
│       └─────────────────────────────────────────│
│                                                 │
│  Mindset: Build simple, throw away, repeat     │
└─────────────────────────────────────────────────┘

flowchart TD A[New Model Release] --> B{Eval Scaffolding} B -->|Obsolete| C[Discard] B -->|Still needed| D[Keep minimal] C --> E[Rebuild lightweight] D --> E E --> F[Focus on core value] F --> A

Example

# Instead of 3-month robust solution:
def quick_context_compressor(text):
    """Expect this to be obsolete by Q2"""
    return simple_summarize(text)

# Focus engineering on unique value
# that won't "fall into the model"

Problem

Complex features built around models become obsolete when next-gen models perform those tasks natively.

Solution

Treat tooling as temporary scaffolding; build simplest solutions expecting they will be discarded.

When to use

Building model-compensating features
Rapid model improvement cycles
Need product agility over long-term investment

Trade-offs

Pros

Fast adaptation to new models
Lower engineering investment risk
Focus on unique value, not workarounds

Cons

May feel wasteful
Requires discipline to avoid over-engineering
Tech debt accumulates if not cleaned

View Original →

Explicit Posterior-Sampling Planner

Orchestration

Flow

┌─────────────────────────────────────┐
│         PSRL Loop                   │
└─────────────────────────────────────┘

Posterior ──▶ Sample Model ──▶ Compute Plan
    ▲                              │
    │                              ▼
    └──── Update ◀── Observe ◀── Execute

P(model|data) → sample → plan → act → reward

flowchart LR P[Posterior P(M|D)] --> S[Sample Model] S --> C[Compute Plan] C --> E[Execute] E --> O[Observe Reward] O --> U[Update Posterior] U --> P

Example

# PSRL-based agent loop
posterior = init_prior(task_models)

while not done:
    model = posterior.sample()
    plan = compute_optimal_plan(model)
    reward = execute(plan)
    posterior.update(observation, reward)
    # Natural language: LLM fills each step

Problem

Agents relying on ad-hoc heuristics explore poorly, wasting tokens and API calls on dead ends.

Solution

Embed PSRL algorithm: maintain Bayesian posterior over task models, sample model, compute optimal plan, execute, update posterior.

When to use

Exploration-heavy tasks
Multi-step decision making
Need principled exploration

Trade-offs

Pros

Principled exploration
Efficient token usage
Theoretical guarantees

Cons

Implementation complexity
Posterior computation cost
Requires RL expertise

View Original →

Initializer-Maintainer Dual Agent

Orchestration & Control

Flow

┌─── ONCE ─────────────────────────────────┐
│ Initializer ──▶ features.json            │
│              ──▶ init.sh                 │
│              ──▶ progress.txt            │
│              ──▶ First Commit            │
└──────────────────────────────────────────┘
                    ▼
┌─── EACH SESSION ─────────────────────────┐
│ Maintainer ──▶ Read git/progress         │
│            ──▶ Select next feature       │
│            ──▶ Implement + Test          │
│            ──▶ Commit + Update progress  │
└──────────────────────────────────────────┘

sequenceDiagram participant Init as Initializer participant FS as Filesystem participant Code as Coding Agent Note over Init: Runs ONCE Init->>FS: feature-list.json Init->>FS: init.sh, progress.txt Note over Code: Runs EACH session loop Session N Code->>FS: Read progress Code->>Code: Implement feature Code->>FS: Update & commit end

Example

# Initializer creates foundation
project/
  feature-list.json  # All features, passes=false
  progress.txt       # Running log
  init.sh            # One-command bootstrap

# Maintainer session ritual
$ ./init.sh && read_progress && implement_next

Problem

Single-agent approaches either over-engineer each session (wasting setup time) or under-invest in foundations (causing drift and confusion).

Solution

Use two specialized agents: Initializer creates foundations once (features, env, tracking); Maintainer handles incremental development across sessions.

When to use

Projects requiring many sessions
Complex applications with 50+ features
When context loss is costly

Trade-offs

Pros

Clear separation of setup vs execution
Prevents context loss across sessions

Cons

Requires upfront specification
Two configs to maintain

View Original →

LLM Map-Reduce Pattern

Orchestration & Control

Flow

        MAP                    REDUCE
[Doc1] ──▶ Sandbox LLM ──┐
[Doc2] ──▶ Sandbox LLM ──┼──▶ Aggregate ──▶ Result
[Doc3] ──▶ Sandbox LLM ──┘
           (isolated)    (safe summaries only)

flowchart LR D1[Doc 1] --> S1[Sandbox] D2[Doc 2] --> S2[Sandbox] D3[Doc 3] --> S3[Sandbox] S1 --> R[Reduce] S2 --> R S3 --> R R --> O[Output]

Example

results = []
for doc in untrusted_docs:
    # Sandboxed: constrained output
    ok = sandbox_llm("Invoice? yes/no", doc)
    results.append(ok)
# Reduce: no raw docs here
final = reduce(results)

Problem

Single poisoned document can manipulate global reasoning if all data processed in one context.

Solution

Map: sandboxed LLMs process each doc independently. Reduce: aggregate only sanitized outputs.

When to use

Processing untrusted docs
File triage/classification
N-to-1 decisions

Trade-offs

Pros

Poisoned item isolated
Scalable parallelism

Cons

Output validation needed
Orchestration overhead

View Original →

Oracle and Worker Multi-Model

Orchestration & Control

Flow

Request ──▶ Worker(Sonnet) ──┬──▶ Execute ──▶ Done
                             │
                        [Stuck?]
                             │
                             └──▶ Oracle(o3) ──▶ Strategy ──▶ Worker

flowchart LR A[Request] --> B[Worker] B --> C{Stuck?} C -->|No| D[Execute] C -->|Yes| E[Oracle] E --> F[Strategy] F --> B D --> G[Done]

Example

worker = Agent(model="sonnet-4")
oracle = Agent(model="o3")

result = worker.execute(task)
if worker.is_stuck():
    strategy = oracle.consult(context)
    result = worker.execute(strategy)

Problem

Single model creates trade-off between capability and cost. Powerful models are expensive for routine tasks.

Solution

Two-tier system: fast Worker (Sonnet) handles bulk tasks, expensive Oracle (o3/Gemini) reserved for high-level reasoning and debugging.

When to use

Mix of routine and complex tasks
Cost optimization is important
Tasks where worker may get stuck

Trade-offs

Pros

Cost-efficient use of frontier models
Specialized AI team approach

Cons

Orchestration complexity
Latency from model switching

View Original →

Progressive Complexity Escalation

Orchestration & Control

Flow

Tier 1: Research ──▶ Present to Human
           │ (proven reliable)
           ▼
Tier 2: Research ──▶ Draft ──▶ Human Approves
           │ (proven reliable)
           ▼
Tier 3: Research ──▶ Draft ──▶ Auto-Send (if conf > 0.8)

flowchart TD A[Tier 1: Info Gathering] --> B{Reliable?} B -->|Yes| C[Tier 2: Draft + Approval] C --> D{Reliable?} D -->|Yes| E[Tier 3: Autonomous]

Example

class AgentCapabilities:
    def process(self, data):
        research = self.research(data)  # Tier 1
        if self.tier >= 2:
            draft = self.generate(research)
            if self.tier >= 3 and self.confidence > 0.8:
                return self.auto_execute(draft)
            return self.request_approval(draft)
        return self.present_findings(research)

Problem

Deploying agents with overly ambitious capabilities from day one leads to unreliable outputs, failed implementations, and safety risks from autonomous high-stakes operations.

Solution

Start with low-complexity, high-reliability tasks and progressively unlock more complex capabilities as models improve and trust is established through capability tiers.

When to use

Deploying agents into production
High-stakes or regulated domains
Building internal automation tools

Trade-offs

Pros

Risk mitigation via limited blast radius
Builds stakeholder confidence
Graceful degradation

Cons

Delayed full automation benefits
Tier management complexity
Promotion friction

View Original →

Stop Hook Auto-Continue Pattern

Orchestration

Flow

Agent Turn ──▶ Stop Hook ──▶ Run Tests ──┐
                                         │
          ┌──────────────────────────────┘
          │ Fail?
          ▼
   Auto-Continue ──▶ Agent Fixes ──▶ [loop until pass]

flowchart LR A[Agent Turn End] --> B[Stop Hook] B --> C{Tests Pass?} C -->|Yes| D[Return to User] C -->|No| E[Continue Agent] E --> A

Example

// hooks config
{
  "on_stop": {
    "command": "./check_tests.sh",
    "auto_continue_on_failure": true
  }
}

Problem

Agents complete turns even when tasks are not truly done (tests fail, checks incomplete)

Solution

Use stop hooks to check success criteria after each turn; auto-continue agent until criteria pass

When to use

Test-driven development workflows
Autonomous task completion required
Sandboxed/containerized environments

Trade-offs

Pros

True task completion guaranteed
No manual re-prompting needed

Cons

Risk of infinite loops
Runaway costs without timeout

View Original →

Tree-of-Thought Reasoning

Orchestration

Flow

                    [Problem]
                        │
            ┌───────────┼───────────┐
            ▼           ▼           ▼
        [Step A]    [Step B]    [Step C]
           │           │           │
       ┌───┴───┐   ┌───┴───┐       ✗
       ▼       ▼   ▼       ▼
    [A1]    [A2] [B1]    [B2]  ← evaluate
       │       ✗   ✓       ✗
       ▼           │
    [A1.1]    [best path]

flowchart TD P[Problem] --> A[Step A] P --> B[Step B] P --> C[Step C] A --> A1[A1] --> A11[A1.1] A --> A2[A2] B --> B1[B1 Best] B --> B2[B2] C --> X[Pruned]

Example

queue = [root_problem]
while queue:
    thought = queue.pop()
    for step in expand(thought):
        score = evaluate(step)
        queue.push((score, step))
return select_best(queue)

Problem

Linear chain-of-thought reasoning gets stuck on complex problems, missing alternatives or failing to backtrack.

Solution

Explore a search tree of intermediate thoughts, expand multiple steps, evaluate partial solutions before committing.

When to use

Complex puzzles or planning tasks
Multiple valid approaches exist
Backtracking may be needed

Trade-offs

Pros

Covers more possibilities
Improves reliability on hard tasks

Cons

Higher compute cost
Needs good scoring method

View Original →

Inversion of Control

Orchestration & Control

Flow

Human: "Refactor UploadService to async"
       │
       ▼
┌─────────────────────────────────────────┐
│          Agent Decides HOW              │
│  ┌─────┐   ┌─────┐   ┌─────┐           │
│  │grep │──▶│edit │──▶│test │──▶ ...    │
│  └─────┘   └─────┘   └─────┘           │
│      (Agent orchestrates 87%)           │
└─────────────────────────────────────────┘
       │
       ▼
Human: Review PR (3%)

sequenceDiagram Dev->>Agent: "Refactor UploadService to async" Agent->>Repo: git grep "UploadService" Agent->>Tools: edit_file Agent->>Tools: run_tests Agent-->>Dev: PR with green CI

Example

# Human provides goal, not steps
agent.run(
    goal="Refactor UploadService to async",
    tools=[grep, edit_file, run_tests, git],
    guardrails=["no prod DB access", "tests must pass"]
)
# Agent decides: grep -> analyze -> edit -> test -> PR

Problem

Traditional 'prompt-as-puppeteer' workflows force humans to spell out every step, limiting scale and creativity.

Solution

Give the agent tools plus a high-level goal and let it decide orchestration. Humans supply guard-rails (10%) while agent handles execution (87%).

When to use

Complex tasks with unclear sequence
Agent has necessary tools
Human oversight available

Trade-offs

Pros

Scales without step-by-step prompting
Unleashes agent creativity

Cons

Requires trust in agent judgment
May take unexpected paths

View Original →

Parallel Tool Call Learning

Orchestration & Control

Flow

BEFORE (Sequential):  T1 ──▶ T2 ──▶ T3 ──▶ T4  (7s)

AFTER (RL-Learned Parallel):
    ┌── T1 ──┐
    ├── T2 ──┼──▶ Results ──▶ T4 ──▶ Done  (3.5s)
    └── T3 ──┘

flowchart LR A[Agent] --> B[Batch 1: Parallel] B --> C[T1] B --> D[T2] B --> E[T3] C & D & E --> F[Aggregate] F --> G[Batch 2] G --> H[Done]

Example

# RL learns parallel patterns naturally
# No explicit parallelization code needed
job = client.fine_tuning.jobs.create(
    model="gpt-4o",
    method="rft",
    rft={"tools": tools, "grader": grader}
)
# Model discovers: batch independent calls

Problem

Agents execute tool calls sequentially even when they could run in parallel, causing unnecessary latency.

Solution

Use Agent RFT to teach models to parallelize independent tool calls, reducing latency by 40-50% when tool execution is fast.

When to use

Tool execution faster than inference
Independent information gathering
Broad exploration phases

Trade-offs

Pros

40-50% latency reduction
Emerges naturally from RL training

Cons

Requires concurrent tool infrastructure
Higher peak resource usage

View Original →

Swarm Migration Pattern

Orchestration

Flow

Main ──▶ Scan (100 files) ──▶ Todo List ──▶ Spawn 10 Agents
                                              │
     ┌─────────────────────────────────────-──┘
     │  [10 files each, parallel]
     ▼
 Verify All ──▶ Consolidated PR

flowchart TD A[Main Agent] --> B[Scan Codebase] B --> C[Create Todo: 100 files] C --> D[Spawn 10 Subagents] D --> E1[Agent 1: 1-10] D --> E2[Agent 2: 11-20] D --> E3[...] E1 --> F[Verify] E2 --> F F --> G[Merged PR]

Example

files = find("*.test.js", old_framework)
for batch in chunk(files, 10):
    spawn_agent(
        task=f"Migrate {batch} to new framework",
        auto_commit=True
    )

Problem

Large-scale code migrations (framework upgrades, lint fixes) are slow when done sequentially

Solution

Main agent orchestrates 10+ parallel subagents, each migrating a batch of files in map-reduce fashion

When to use

Framework migrations (Jest to Vitest, etc.)
Lint rule rollouts across many files
API updates or code modernization

Trade-offs

Pros

10x+ speedup via parallelization
Fault isolation per batch

Cons

High token cost for parallel agents
Potential merge conflicts

View Original →

Conditional Parallel Tool Execution

Orchestration & Control

Flow

Tools ──▶ Classify ──┬── Read-Only? ── Parallel ──┐
                     │                            │
                     └── Has-Write? ── Sequential ┴──▶ Results

flowchart LR A[Tool Batch] --> B{All Read-Only?} B -->|Yes| C[Execute Parallel] B -->|No| D[Execute Sequential] C --> E[Results] D --> E

Example

def execute_batch(tools):
    if all(t.is_read_only for t in tools):
        return parallel_execute(tools)
    else:
        return sequential_execute(tools)

# FileRead, Grep: parallel
# FileWrite: sequential

Problem

Sequential tool execution causes delays for read-only operations, but parallel execution risks race conditions for state-modifying tools.

Solution

Classify tools as read-only or state-modifying. Execute read-only batches in parallel, serialize state-modifying operations.

When to use

Mix of read and write operations
Performance-sensitive applications
Batch tool calls in single reasoning step

Trade-offs

Pros

Fast parallel reads, safe sequential writes
Prevents race conditions

Cons

Single write in batch serializes everything
Relies on accurate tool classification

View Original →

Anti-Reward-Hacking Grader Design

Reliability & Eval

Flow

Answer ──▶ ┌──────────────────┐
           │  Multi-Criteria  │
           │     Grader       │
           └────────┬─────────┘
                    │
     ┌──────────────┼──────────────┐
     ▼              ▼              ▼
Correctness    Reasoning     Citations
  (0.50)        (0.20)        (0.10)
     │              │              │
     └──────────────┴──────────────┘
                    │
              Weighted Sum ──▶ Final Score

flowchart TD A[Answer + Trace] --> B{Gaming Pattern?} B -->|Yes| C[Score: 0.0] B -->|No| D[Multi-Criteria Eval] D --> E[Correctness 0.50] D --> F[Reasoning 0.20] D --> G[Citations 0.10] E & F & G --> H[Weighted Sum] H --> I[Final Score]

Example

scores = {
    'correctness': 0.50,
    'reasoning': 0.20,
    'completeness': 0.15,
    'citations': 0.10,
    'formatting': 0.05
}
# Check gaming patterns first
if gaming_pattern_detected(trace):
    return {"score": 0.0, "violation": True}
# Multi-criteria weighted score
final = sum(w * scores[k] for k, w in weights)

Problem

RL models exploit grader loopholes to maximize reward without actually solving tasks, leading to 100% validation scores but poor real performance.

Solution

Design multi-criteria graders with iterative hardening, weighted subscores, and explicit gaming pattern detection.

When to use

Training agents with reinforcement learning
Initial grader shows suspiciously high scores
Production performance doesn't match validation metrics

Trade-offs

Pros

Robust learning - models solve tasks, not game metrics
Better generalization via multi-criteria evaluation
Debuggable subscores for identifying struggles

Cons

Engineering effort for careful design and iteration
Slower convergence due to harder grading
Computational cost for multi-criteria evaluation

View Original →

CLI-First Skill Design

Tool Use

Flow

               ┌── Human: $ skill.sh list
               │
Skill Logic ──▶ CLI ──┼── Agent: Bash("skill.sh list")
               │
               └── Cron: */5 * * * * skill.sh sync

Unix Philosophy: One tool ──▶ One task ──▶ Compose with pipes

flowchart LR A[Skill Logic] --> B[CLI Interface] B --> C[Human: Terminal] B --> D[Agent: Bash Tool] B --> E[Scripts: Automation] B --> F[Cron: Scheduled]

Example

#!/bin/bash
# trello.sh - CLI-first skill
case "$1" in
  boards) curl -s "$API/boards" ;;
  cards)  curl -s "$API/boards/$2/cards" ;;
  create) curl -X POST "$API/cards" -d "name=$3" ;;
esac

# Human: trello.sh boards | jq '.name'
# Agent:  Bash("trello.sh cards abc123")

Problem

API-first는 디버깅 어렵고, GUI-first는 에이전트가 사용 불가. 두 인터페이스를 따로 만들어야 하는 부담

Solution

모든 스킬을 CLI 도구로 설계. 사람은 터미널에서, 에이전트는 Bash 도구로 동일하게 사용 가능

When to use

사람과 에이전트 모두 사용할 스킬 개발 시
디버깅/테스트가 용이해야 할 때
Unix 도구와 파이프 조합이 필요할 때
런타임 의존성 최소화 필요 시

Trade-offs

Pros

사람/에이전트 동시 사용
쉬운 디버깅 및 테스트
Unix 도구와 조합 가능

Cons

복잡한 데이터 구조 처리 어려움
프로세스 생성 오버헤드
Windows 호환성 제한

View Original →

CriticGPT-Style Evaluation

Reliability & Eval

Flow

Generator ──▶ Code ──▶ CriticGPT ──▶ Issues?
                              │
                    ┌─────────┴─────────┐
                    │                   │
                   Yes                  No
                    │                   │
              ◀── Refine           ──▶ Human Review

sequenceDiagram Generator->>Critic: Submit code loop Until passes Critic->>Critic: Bug + Security scan Critic->>Generator: Issues found Generator->>Critic: Refined code end Critic->>Human: Present for review

Example

critic = CriticGPT(severity_threshold=0.7)
review = critic.review_code(code)

if review['issues']:
    code = generator.refine(code, review)
else:
    submit_to_human(code, review)

Problem

Human reviewers struggle to catch subtle bugs in sophisticated AI-generated code at scale.

Solution

Deploy specialized critic models trained for code review to identify bugs, security issues, and quality problems.

When to use

High volume of AI-generated code
Security-critical applications
Need consistent quality standards

Trade-offs

Pros

Catches subtle bugs humans miss
Consistent 24/7 reviews
Scalable code review process

Cons

False positives need human verification
Cannot understand full business context
May miss novel vulnerability types

View Original →

Extended Coherence Work Sessions

Reliability & Eval

Flow

Coherence Window Evolution:

Early:  [██░░░░░░░░░░░░░░░░░░]  ~5 min
        └─ loses track quickly

Current: [██████████████░░░░░░]  ~hours
         └─ sustained focus

Future:  [████████████████████]  all-day
         └─ human-equivalent

gantt title Agent Coherence Over Time dateFormat X axisFormat %s section Early Models Short coherence :done, 0, 300 section Current Extended coherence :active, 300, 10800 section Future All-day coherence :future, 10800, 86400

Example

# Coherence doubles every 7 months
agent = ExtendedCoherenceAgent(
    context_window="200K tokens",
    state_management="persistent",
    session_length="hours"
)
# Can now handle multi-hour projects

Problem

Early AI agents lose coherence after a few minutes, limiting their utility for complex multi-stage tasks requiring sustained effort.

Solution

Use models and architectures designed to maintain coherence over hours through larger context windows and better state management.

When to use

Complex multi-step projects (hours of work)
Tasks requiring sustained context across stages
Prolonged problem-solving sessions

Trade-offs

Pros

Enables human-equivalent work sessions
Handles complex, multi-stage tasks

Cons

Requires advanced model architecture
Higher computational cost

View Original →

Lethal Trifecta Threat Model

Reliability & Eval

Flow

    [Private Data]
         /    \
        /      \
[Untrusted] ─── [External Comm]

All 3 = DANGER! Block at least one circle.

flowchart TB A[Private Data] --- B[Untrusted Input] B --- C[External Comm] C --- A style A fill:#ffcccc style B fill:#ffcccc style C fill:#ffcccc

Example

# Pre-execution policy check
if (tool.can_externally_communicate and
    tool.accesses_private_data and
    input_source == "untrusted"):
    raise SecurityError("Lethal trifecta!")

Problem

Combining private data + untrusted input + external communication enables prompt injection data exfiltration.

Solution

Audit every tool for these 3 capabilities and guarantee at least one is blocked in any execution path.

When to use

Agents with tool access
Processing untrusted data
Security-critical systems

Trade-offs

Pros

Simple mental model
Eliminates attack class

Cons

Limits all-in-one agents
Capability tagging effort

View Original →

Merged Code + Language Skill Model

Reliability & Eval

Flow

Base LLM ──┬──▶ Lang Specialist ──┐
           │                       │
           └──▶ Code Specialist ──┼──▶ Weight Merge ──▶ Unified Model
                                   │
              (Fisher Avg / α=0.5)

flowchart LR A[Base LLM] --> B[Lang Specialist] A --> C[Code Specialist] B --> D[Weight Merge] C --> D D --> E[Unified Model]

Example

# Merge two specialist checkpoints
python merge_models.py \
  --model_a lang-specialist.pt \
  --model_b code-specialist.pt \
  --output merged-agent.pt \
  --alpha 0.5

Problem

Training a single model for both code and natural language requires massive compute and risks skill interference.

Solution

Train separate specialist models independently, then merge weights to combine skills without centralized training.

When to use

Building multi-skill models (code + NL)
Limited centralized compute resources
Parallel R&D teams for different capabilities

Trade-offs

Pros

Parallel development of skills
Reduced centralized compute needs

Cons

Potential skill dilution from averaging
Requires identical model architectures

View Original →

RLAIF (Reinforcement Learning from AI Feedback)

Reliability & Eval

Flow

                    ┌──────────────────┐
Prompt ──▶ Model ──▶│ Response A       │
                    │ Response B       │
                    └────────┬─────────┘
                             ▼
                    ┌──────────────────┐
                    │  AI Critic       │
                    │  + Constitution  │──▶ Preference (A > B)
                    └──────────────────┘
                             │
                             ▼
                    Train Reward Model

flowchart LR A[Prompt] --> B[Generate A, B] B --> C[AI Critic] C --> D[Compare with Principles] D --> E[Preference Data] E --> F[Train Reward Model]

Example

class RLAIF:
    def generate_preference(self, prompt, a, b):
        critique = f"""Given principles: {self.constitution}
Which response is better for "{prompt}"?
A: {a}  B: {b}
Choose and explain why."""
        return self.critic.generate(critique)

Problem

Traditional RLHF requires extensive human annotation for preference data, which is expensive ($1+ per annotation), time-consuming, and creates a bottleneck in training aligned AI systems.

Solution

Use AI models to generate preference feedback and evaluation data based on constitutional principles, reducing costs to less than $0.01 per annotation while maintaining quality.

When to use

Need large-scale preference data
Human annotation is too expensive
Training aligned AI systems

Trade-offs

Pros

100x cheaper than human feedback
Unlimited scalability
More consistent than varying annotators

Cons

May amplify existing model biases
Cannot provide truly novel insights
Requires careful principle design

View Original →

Structured Output Specification

Reliability

Flow

Input ──▶ LLM + Schema ──▶ Validated Output ──┬──▶ DB
                                              ├──▶ API
                                              └──▶ Next Agent

flowchart LR A[Agent Input] --> B[LLM + Schema] B --> C[Validated Output] C --> D[Downstream System] C --> E[Database] C --> F[Next Agent]

Example

const schema = z.object({
  category: z.enum(['spam', 'legit']),
  confidence: z.number().min(0).max(1)
});
const result = await generateObject({
  model, schema, prompt
});

Problem

Free-form agent outputs are hard to validate, parse, and integrate with downstream systems

Solution

Constrain outputs using deterministic schemas (JSON Schema, Zod, Pydantic) enforced at generation time

When to use

Multi-phase agent workflows
Classification/categorization tasks
Integration with databases or APIs

Trade-offs

Pros

Guaranteed parseable outputs
Type safety and validation

Cons

Rigidity limits free-form responses
Schema evolution friction

View Original →

Versioned Constitution Governance

Reliability

Flow

Agent ──▶ Propose Change ──▶ Git PR
                                  │
                           ┌──────▼──────┐
                           │   CI Check  │
                           │ - Signed?   │
                           │ - Policy OK?│
                           └──────┬──────┘
                                  │
                     ┌────────────┼────────────┐
                     ▼            ▼            ▼
                  [PASS]      [REVIEW]      [REJECT]
                     │            │
                     ▼            ▼
               Gatekeeper ──▶ Merge

flowchart LR A[Agent] -->|propose| B[Git PR] B --> C{CI Check} C -->|pass| D[Gatekeeper] D -->|approve| E[Merge] C -->|fail| F[Reject] E --> G[Constitution HEAD]

Example

# constitution.yaml (in signed git repo)
rules:
  - name: "no_secret_exfil"
    level: critical
    immutable: true
  - name: "confirm_destructive"
    level: high

# CI: flag deletion of critical rules

Problem

Self-modifying agents can accidentally violate safety rules or regress on alignment when rewriting their constitution.

Solution

Store constitution in version-controlled, signed repo with CI policy checks; agent proposes, gatekeeper merges.

When to use

Agent can modify its own rules
Safety constraints are critical
Audit trail is required

Trade-offs

Pros

Full audit history
Prevents unauthorized changes

Cons

Slower iteration cycle
Requires governance overhead

View Original →

Asynchronous Coding Agent Pipeline

Reliability & Eval

Flow

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Inference  │───▶│ Tool Queue  │───▶│   Tool      │
│  Workers    │    │ (Redis/MQ)  │    │  Executors  │
│   (GPU)     │◀───│             │◀───│   (CPU)     │
└──────┬──────┘    └─────────────┘    └─────────────┘
       │
       ▼ trajectories
┌─────────────┐    ┌─────────────┐
│   Replay    │───▶│   Learner   │
│   Buffer    │    │(Policy Upd) │
└─────────────┘    └──────┬──────┘
                          │ new checkpoint
                          ▼

Example

# Async pipeline components
inference_worker.submit_action("compile", file)
# No blocking - continue inference

# Tool executor (separate process)
tool_queue.subscribe("compile_requests")
result = run_compile(request)
result_queue.publish(result)

# Learner updates policy periodically
learner.update_from_buffer(replay_buffer)

Problem

Synchronous tool execution (compilation, testing) creates compute bubbles and idle GPUs, blocking agents while waiting for I/O-bound operations.

Solution

Decouple inference, tool execution, and learning into parallel async components communicating via message queues.

When to use

Running RL training for coding agents
Tool calls have high latency (compilation, tests)
Need to maximize GPU utilization

Trade-offs

Pros

High utilization - GPUs stay busy during I/O
Scalable - independently scale inference, tools, learning

Cons

Complex system maintenance across services
Staleness management for policy updates

View Original →

No-Token-Limit Magic

Reliability & Eval

Flow

PROTOTYPE PHASE                    PRODUCTION PHASE
    │                                    │
    ▼                                    ▼
[No Limits] ──▶ [Rich Output] ──▶ [Pattern Discovery] ──▶ [Optimized]
    │               │                    │                    │
 $$$$$           Quality             Insights              $

flowchart LR A[Prototype: No Limits] --> B[Rich Output] B --> C[Pattern Discovery] C --> D[Production: Optimized]

Example

# Prototype: no limits
config = {
    "max_tokens": None,  # Unlimited
    "reasoning_passes": 5,
    "self_correction": True
}
# Find patterns, then optimize later

Problem

Aggressive prompt compression to save tokens stifles reasoning depth and self-correction capabilities.

Solution

During prototyping, remove hard token limits. Allow lavish context and multiple reasoning passes to discover valuable patterns.

When to use

Prototyping and experimentation phase
Discovering optimal reasoning patterns
Tasks requiring deep self-correction

Trade-offs

Pros

Dramatically better output quality
Surfaces valuable patterns for optimization

Cons

Higher cost during prototyping
Not suitable for production without optimization

View Original →

Deterministic Security Scanning Build Loop

Security & Safety

Flow

Agent ──▶ Generate ──▶ make all ──▶ Scan Pass?
                                        │
                          ┌─────────────┴─────────────┐
                          │                           │
                         No                          Yes
                          │                           │
                    See errors              ──▶ Complete ✓
                          │
                    ◀── Regenerate

flowchart TD A[Agent generates code] --> B[Run make all] B --> C{Security scan?} C -->|Fail| D[Error in context] D --> E[Agent regenerates] E --> A C -->|Pass| F[Done]

Example

all: build test security-scan

security-scan:
    semgrep --config=auto src/
    bandit -r src/
    @exit $?

Problem

Non-deterministic security approaches (cursor rules, MCP tools) are suggestions that LLMs may ignore.

Solution

Integrate deterministic security scanners into the build loop that agents must run after every change.

When to use

AI-assisted code generation
Security-critical applications
Need consistent policy enforcement

Trade-offs

Pros

Deterministic, battle-tested tools
Reuses existing security infra
Works with any agent/harness

Cons

Increases build time
May produce false positives
Requires fast tools for good DX

View Original →

Egress Lockdown

Security

Flow

┌─────────────────────────────────────────┐
│            SANDBOXED AGENT              │
│  [Private Data] ──▶ [Agent] ──▶ ???     │
└─────────────────────────────────────────┘
                      │
          ┌───────────┼───────────┐
          ▼           ▼           ▼
    api.internal   attacker.com  ANY
         ✓             ✗          ✗

   OUTPUT DROP (default) | ACCEPT (whitelist)

Example

# Docker egress rules
iptables -P OUTPUT DROP  # default-deny
iptables -A OUTPUT \
  -d api.mycompany.internal \
  -j ACCEPT

# Log blocked attempts
iptables -A OUTPUT -j LOG

Problem

Even with private-data access and untrusted inputs, attacks fail if the agent has no way to transmit stolen data.

Solution

Implement egress firewall: allow only specific domains, strip content from outbound calls, forbid dynamic link generation.

When to use

Agents handle sensitive data
Processing untrusted inputs
High-security environments

Trade-offs

Pros

Drastically reduces leak risk
Easy to reason about
Simple network rules

Cons

Breaks legitimate integrations
Requires proxy stubs
Limits agent capabilities

View Original →

Isolated VM per RL Rollout

Security & Safety

Flow

               ┌─── VM1 [Rollout 1] ───┐
               │   shell("grep TODO")  │
Training ──▶   ├─── VM2 [Rollout 2] ───┼──▶ Rewards
               │   shell("rm temp")    │
               └─── VM3 [Rollout 3] ───┘

Each VM: [Fresh FS] → [Execute] → [Destroy]

sequenceDiagram participant T as Training participant VM1 as VM 1 participant VM2 as VM 2 T->>VM1: Rollout 1: shell() T->>VM2: Rollout 2: shell() VM1-->>T: Result VM2-->>T: Result Note over VM1,VM2: VMs destroyed

Example

@app.cls(image=base_image, timeout=600)
class IsolatedToolExecutor:
    def execute_shell(self, rollout_id, cmd):
        # Safe: isolated VM per rollout
        return subprocess.run(cmd, shell=True)
    # VM auto-destroyed after rollout

Problem

RL training rollouts share infrastructure, causing cross-contamination and corrupted rewards when agents execute destructive commands.

Solution

Spin up an isolated VM/container for each rollout, ensuring complete environment isolation with fresh state.

When to use

RL training with tool-using agents
Agents with shell/filesystem access
Parallel rollouts (100+)

Trade-offs

Pros

Complete isolation
Deterministic rewards

Cons

Cost of 100s VMs
Provisioning latency

View Original →

PII Tokenization

Security & Safety

Flow

Tool ──▶ [john@x.com] ──▶ MCP Client ──▶ [EMAIL_1] ──▶ Model
                              │
                         Tokenize
                              │
Model ──▶ "Send to [EMAIL_1]" ──▶ MCP ──▶ [john@x.com] ──▶ Tool
                                    │
                               Untokenize

flowchart LR A[Tool Response] --> B[MCP Client] B --> C[Tokenize PII] C --> D[Model sees tokens] D --> E[Tool Call with tokens] E --> F[Untokenize] F --> G[Real Tool Call]

Example

# MCP client intercepts
data = {"email": "john@example.com"}
# Tokenized for model
ctx = {"email": "[EMAIL_1]"}
# Agent reasons with tokens
# Real value restored for tools

Problem

Sending raw PII through the model's context creates privacy risks and compliance concerns.

Solution

Intercept tool responses to tokenize PII before reaching model, untokenize when making actual tool calls.

When to use

Workflows processing customer data
Compliance-sensitive environments (GDPR, HIPAA)
Multi-step automation involving PII

Trade-offs

Pros

Prevents raw PII in model context
Transparent to agent reasoning

Cons

PII detection must be accurate
Doesn't prevent inference of patterns

View Original →

Agent-First Tooling and Logging

Tool Use & Environment

Flow

Agent ──▶ CLI --for-agent --json ──▶ System
                                        │
              ┌─────────────────────────┘
              ▼
         Unified Logger (JSON Lines)
              │
              ▼
         Agent Parses Structured Data
              │
              ▼
         Next Action Based on Analysis

sequenceDiagram participant A as Agent participant C as CLI --for-agent participant L as Unified Logger A->>C: command --json C->>L: Write JSON log entry L->>A: Return structured data A->>A: Parse & decide next action

Example

# Human-friendly (hard to parse)
$ npm test
PASS src/test.js

# Agent-friendly (easy to parse)
$ npm test --json --for-agent
{"status":"pass","file":"src/test.js"}

Problem

컬러, 멀티라인 출력을 가진 사람 중심 도구는 에이전트가 안정적으로 파싱하기 어려움

Solution

기계 판독 가능한 출력(JSON), 통합 로그, --for-agent 플래그를 가진 에이전트 우선 도구 설계

When to use

에이전트 기반 워크플로우 구축
통합할 다중 로그 소스가 있을 때
신뢰할 수 있는 자동화가 필요할 때

Trade-offs

Pros

더 나은 파싱 정확도
토큰 낭비 감소

Cons

사람의 가독성 희생
도구 수정에 투자 필요

View Original →

CLI-Native Agent Orchestration

Tool Use & Environment

Flow

┌─────────────────────────────────────┐
│          Integration Points         │
└─────────────────────────────────────┘
        │         │         │
        ▼         ▼         ▼
   Makefile   Git Hook   Cron Job
        │         │         │
        └────┬────┴────┬────┘
             │         │
             ▼         ▼
      claude spec   claude repl
             │
             ▼
      Local Context ──▶ Agent ──▶ Output

flowchart TD A[Makefile] --> D[claude CLI] B[Git Hook] --> D C[Cron Job] --> D D --> E[Load Local Context] E --> F[Agent Processing] F --> G[Output/Changes]

Example

# In your Makefile
generate-from-spec:
    claude spec run --input api.yaml --output src/

test-spec-compliance:
    claude spec test --spec api.yaml --codebase src/

# Git pre-commit hook
claude spec test || exit 1

Problem

Web chat UIs are awkward for repeat runs, local file edits, or scripting inside CI pipelines.

Solution

Expose agent capabilities through a first-class CLI for Makefiles, Git hooks, cron jobs, and headless automation.

When to use

Integrating agents into CI/CD pipelines
Automating repetitive development tasks
Need headless or scripted agent execution

Trade-offs

Pros

Scriptable and composable with other tools
Works offline with local context
Easy to embed in existing workflows

Cons

Initial install and auth setup
Learning curve for CLI flags

View Original →

Code-Then-Execute Pattern

Tool Use & Environment

Flow

LLM ──▶ [Write DSL] ──▶ Static Check ──▶ Sandbox Run
             │              │                │
        ┌────┴────┐    ┌────┴────┐      ┌────┴────┐
        │ x=read  │    │ taint   │      │ locked  │
        │ y=proc  │───▶│ verify  │─────▶│ execute │
        │ z=write │    │ flows   │      │         │
        └─────────┘    └─────────┘      └─────────┘

flowchart LR A[LLM] --> B[Write DSL Code] B --> C[Static Checker] C --> D[Taint Verify] D --> E[Sandbox Execute] E --> F[Results]

DSL Example

x = calendar.read(today)
y = QuarantineLLM.format(x)
email.write(to="john@acme.com", body=y)
# Static check: tainted var can't reach recipient

Problem

Plan lists are opaque; need full data-flow analysis and taint tracking for security

Solution

LLM outputs sandboxed program/DSL. Static checker verifies flows, interpreter runs in locked sandbox

When to use

Complex multi-step agents
SQL copilots
Software engineering bots

Trade-offs

Pros

Formal verifiability
Replay logs for audit

Cons

Requires DSL design
Static analysis infrastructure

View Original →

Dual-Use Tool Design

Tool Use

Flow

Human ──▶ /commit ──┐
                    │     ┌────────────────┐
                    ├───▶ │  Shared Tool   │ ──▶ Result
                    │     │  (same logic)  │
Agent ──▶ /commit ──┘     └────────────────┘

"Everything you can do, Claude can do"

flowchart LR H[Human] -->|/commit| T[Shared Tool] A[Agent] -->|/commit| T T --> R[Same Result] T --> L[Same Logs]

Example

define_slash_command("/commit", {
    "steps": ["lint", "gen message", "commit"],
    "callable_by": ["human", "agent"],
    "pre_allowed": ["git add", "git commit"]
})
# Human: $ /commit
# Agent: agent.call("/commit")

Problem

Building separate tools for humans and AI agents creates maintenance overhead, inconsistent behavior, and feature drift.

Solution

Design all tools to be dual-use: same interface, shared logic, equally accessible to both humans and AI agents.

When to use

Building developer tools
Creating slash commands
Designing agent-assisted workflows

Trade-offs

Pros

Reduced maintenance (one implementation)
Consistent behavior for both
Single test suite

Cons

Must satisfy both ergonomics
May compromise optimization
Documentation challenge

View Original →

LLM-Friendly API Design

Tool Use & Environment

Flow

Human API          LLM-Friendly API
─────────          ────────────────
Complex nesting    Flat structure
Implicit version   Explicit v2.0
Cryptic errors     Actionable errors
Many indirects     2 levels max

flowchart LR A[API v2.0] --> B[Self-descriptive] B --> C[Clear Errors] C --> D[LLM Success]

Example

# LLM-friendly function
def create_user_v2(
    name: str,      # Descriptive param names
    email: str
) -> UserResult:   # Typed return
    """Creates a new user account."""
    # Clear errors, not cryptic codes

Problem

APIs designed for humans are often ambiguous or complex for LLMs, causing unreliable tool use.

Solution

Design APIs with explicit versioning, self-descriptive names, clear errors, and minimal indirection.

When to use

Building agent-callable APIs
Exposing tools to LLMs
Internal library design

Trade-offs

Pros

Reliable LLM tool use
Self-correcting errors

Cons

API redesign effort
May be verbose

View Original →

Multi-Platform Communication Aggregation

Tool Use

Flow

                    ┌─── iMessage ───┐
                    │                │
Query ──▶ Agent ───┼─── Slack ──────┼──▶ Aggregate ──▶ Results
                    │                │
                    └─── Email ──────┘

Output: [{platform, sender, time, content, url}, ...]

flowchart LR Q[User Query] --> A[Aggregator] A --> M[iMessage] A --> S[Slack] A --> E[Email] M --> R[Results] S --> R E --> R R --> T[Unified Table]

Example

search_all() {
  query="$1"
  messages search "$query" > /tmp/msg.json &
  slack search "$query" > /tmp/slack.json &
  email search "$query" > /tmp/email.json &
  wait
  aggregate_results /tmp/*.json
}

Problem

Searching for info across multiple platforms (email, Slack, iMessage) requires slow manual checks on each

Solution

Create unified search interface that queries all platforms in parallel and aggregates results into single format

When to use

"Where did X mention Y?" searches
Finding conversations without knowing platform
Cross-platform audit/compliance
Building unified inbox features

Trade-offs

Pros

Single query searches all platforms
Parallel execution minimizes latency
Extensible to new platforms

Cons

Must maintain per-platform adapters
Cross-platform ranking is subjective
Aggregation increases privacy exposure

View Original →

Patch Steering via Prompted Tool Selection

Tool Use & Environment

Flow

User: "Refactor X" ──▶ [+Prompt: "Use ASTRefactor"] ──▶ Agent
                                                          │
                                                    Selects: ASTRefactor
                                                          │
                                                          ▼
                                                    Safe AST Patch

flowchart LR A[User Request] --> B[Augmented Prompt] B --> C[Agent selects ASTRefactor] C --> D[Safe Code Patch]

Example

# Prompt template with tool steering
prompt = f"""
Task: {task_description}
Preferred tool: ASTRefactor
Usage: {{"file": str, "pattern": str}}
Fallback: apply_patch if AST fails
"""

Problem

Agents with multiple patching tools may choose suboptimal ones, causing inconsistent and lower quality results.

Solution

Steer tool selection through explicit natural language instructions, specifying preferred tools and usage patterns in prompts.

When to use

Multiple patching/refactoring tools available
Need for consistent tool usage
AST-safe operations preferred over text replace

Trade-offs

Pros

Predictable tool selection behavior
Higher code quality with semantic tools

Cons

Prompt length increases
Requires maintenance as tools evolve

View Original →

Proactive Trigger Vocabulary

UX & Collaboration

Flow

User Input ──▶ Match Triggers?
                    │
     ┌──────────────┼──────────────┐
     │              │              │
     ▼              ▼              ▼
"sup" ──▶      "search hn"    No match
priority       ──▶ hn-search   ──▶ General
-report        skill           response

graph TD A[User Input] --> B{Match Triggers?} B -->|"sup"| C[priority-report skill] B -->|"search hn"| D[hn-search skill] B -->|No match| E[General response]

Example

skill: priority-report
triggers:
  exact: ["sup", "standup prep"]
  contains: ["what should I work on"]
  patterns: ["what.*on my plate"]
proactive: true

Problem

다양한 스킬을 가진 에이전트가 사용자의 자연어 입력을 어떤 스킬로 라우팅할지 불투명하여 사용자가 어떤 문구로 기능을 활성화하는지 알기 어려움

Solution

각 스킬에 명시적 트리거 어휘(키워드, 패턴)를 정의하고 문서화하여 투명하고 예측 가능한 스킬 라우팅 제공

When to use

에이전트가 여러 스킬/기능을 보유
사용자에게 명확한 활성화 방법 제공 필요
특정 주제 시 자동 활성화(proactive) 원할 때
빠르고 예측 가능한 라우팅이 중요할 때

Trade-offs

Pros

투명하고 예측 가능한 라우팅
문자열 매칭으로 빠른 속도
디버깅 용이

Cons

트리거 리스트에 없는 표현은 누락
어휘 변화에 따른 유지보수 필요
다국어/문화권 편향 가능

View Original →

Shell Command Contextualization

Tool Use

Flow

User ──▶ "!ls -la" ──▶ Interface ──▶ Shell ──▶ Output
                                        │
          Agent Context ◀───────────────┘
              [cmd + output injected]

sequenceDiagram User->>Interface: !ls -la Interface->>Shell: Execute command Shell-->>Interface: Output Interface->>Agent: Inject cmd + output Agent-->>User: Response with context

Example

# User types in Claude Code:
!git status

# Agent receives in context:
# Command: git status
# Output: On branch main, 2 files changed...

Problem

Manually copying shell command output into agent context is tedious and error-prone

Solution

Provide a special prefix (e.g., !) that executes shell commands and auto-injects both command and output into agent context

When to use

Agent needs real-time environment state
Checking git status, file listings, test results
Interactive debugging with agent assistance

Trade-offs

Pros

Seamless environment integration
No manual copy-paste required

Cons

Security risk from arbitrary commands
Large outputs can bloat context

View Original →

Tool Use Steering via Prompting

Tool Use

Flow

User Task ──┬──▶ [Direct: "Use file search tool"]
            │
            ├──▶ [Teach: "Use barley CLI, -h for help"]
            │
            ├──▶ [Implicit: "commit, push, pr"]
            │
            └──▶ [Think: "*think hard*"]
                       │
                       ▼
               Agent Tool Selection ──▶ Execute

flowchart TD A[User Task] --> B[Available Tools] A --> C[Explicit Guidance] C --> D[Direct Invocation] C --> E[Teaching Usage] C --> F[Implicit Shortcut] D & E & F --> G[Agent Selection] G --> H[Execute]

Example

# Direct invocation
"Use the file search tool to find config files"

# Teaching tool usage
"Use our barley CLI to check logs. -h for help"

# Implicit shortcut (learned association)
"commit, push, pr"  # agent knows git workflow

Problem

Having tools available does not guarantee agents will use them appropriately, especially for custom or team-specific tools.

Solution

Guide tool selection via explicit prompts: direct invocation, teaching tool usage, implicit shortcuts, and deeper reasoning triggers.

When to use

Custom or team-specific tools exist
Agent tool selection is suboptimal
Teaching new tool workflows

Trade-offs

Pros

Direct control over tool usage
Enables custom tool onboarding

Cons

Requires prompt engineering
May reduce agent autonomy

View Original →

Agent SDK for Programmatic Control

Tool Use & Environment

Flow

Application/Script ──▶ Agent SDK
                            │
          ┌─────────────────┼─────────────────┐
          │                 │                 │
          ▼                 ▼                 ▼
     CLI Interface    Python Lib      TS Library
          │                 │                 │
          └─────────────────┼─────────────────┘
                            ▼
                       Agent Core
                     (Tools, Memory)

flowchart TD A[Application/Script] --> B[Agent SDK] B --> C[CLI Interface] B --> D[Python Library] B --> E[TypeScript Library] C & D & E --> F[Agent Core] F --> G[Tool Access] F --> H[Memory]

Example

# CLI usage for CI/CD
$ claude -p "what changed this week?" \
  --allowedTools "Bash(git log:*)" \
  --output-format json

# Python SDK
agent.run("Review PR", tools=["git"])

Problem

대화형 인터페이스는 CI/CD 파이프라인, 스케줄 작업, 커스텀 애플리케이션 통합을 지원하지 않음

Solution

프로그래밍 방식 접근 및 자동화를 위해 에이전트 액션을 노출하는 SDK(CLI, Python, TypeScript) 제공

When to use

CI/CD 통합이 필요할 때
커스텀 애플리케이션 구축 시
배치 처리 워크플로우

Trade-offs

Pros

자동화 파이프라인 가능
유연한 통합 옵션

Cons

SDK 학습 곡선
API 키 관리가 필요할 수 있음

View Original →

Code Mode MCP Tool Interface

Tool Use & Environment

Flow

Traditional:
  LLM ──▶ Tool1 ──▶ [JSON 1000t] ──▶ LLM
  LLM ──▶ Tool2 ──▶ [JSON 1000t] ──▶ LLM
  LLM ──▶ Tool3 ──▶ [JSON 1000t] ──▶ LLM

Code Mode:
  LLM ──▶ V8 Isolate ┬──▶ Tool1
                     ├──▶ Tool2
                     └──▶ Tool3
                          │
                     [Condensed] ──▶ LLM

Example

// LLM generates orchestration code
const vpc = await createVPC({name: "demo"})
const igw = await createInternetGateway(vpc.id)
const sg = await createSecurityGroup(vpc.id, rules)
const ec2 = await launchEC2({vpcId: vpc.id})
// All calls in isolate - only result to LLM
return {vpcId: vpc.id, publicIP: ec2.ip}

Problem

Traditional MCP tool calls force all intermediate JSON through the context, causing massive token waste on multi-step and fan-out operations.

Solution

LLMs write TypeScript code that orchestrates MCP tools in V8 isolates; only final results return to context.

When to use

Multi-step workflows with clear sequences
Fan-out operations (100+ items to process)
Token costs or latency are critical

Trade-offs

Pros

10x+ token reduction on multi-step workflows
Dramatic fan-out efficiency with loops
Self-debugging with error handling and retry

Cons

Requires V8 isolate infrastructure
Poor fit for dynamic research loops
Intelligence-in-middle defeats the purpose

View Original →

Dynamic Code Injection

Tool Use

Flow

User: "@src/Button.js"
         │
         ▼
┌─────────────────────────────┐
│  Preprocessor               │
│  1. Parse @mention          │
│  2. Read file (lines 1-50)  │
│  3. Inject into context     │
└─────────────────────────────┘
         │
         ▼
Agent: [Context + Button.js] ──▶ Continue

sequenceDiagram participant U as User participant P as Preprocessor participant FS as File System participant A as Agent U->>P: @src/Button.js:10-50 P->>FS: Read lines 10-50 FS-->>P: File content P->>A: Inject into context A-->>U: Continue with file visible

Example

# Syntax examples
"@path/to/file.ext"        # Full file
"/load file.ext:10-50"     # Line range
"/summarize test_spec.py"  # Extract summary

# Injected as:
# /// BEGIN Button.js
# ...content...
# /// END Button.js

Problem

Manually copying files into prompts is tedious, wastes tokens, and interrupts workflow momentum.

Solution

Allow on-demand file injection via @filename or /load syntax that fetches, optionally summarizes, and injects code into context.

When to use

Interactive coding sessions
Exploring unfamiliar codebases
Need specific file context mid-conversation

Trade-offs

Pros

Interactive code exploration
No manual copy/paste
Improves agent accuracy

Cons

Requires file system access
Security risk if unsandboxed
Summarization may lose context

View Original →

Progressive Tool Discovery

Tool Use & Environment

Flow

servers/
├── google-drive/   ──▶ list (names only)
│   ├── getDocument     ──▶ search (+ description)
│   └── listFiles           ──▶ get (full schema)
├── slack/
└── github/

Agent: list_dir("./servers/") → ["google-drive/", ...]
       search_tools("google-drive/*", "name+desc")
       get_tool_definition("getDocument") → {schema}

flowchart LR A[List Servers] --> B[Browse Category] B --> C[Search Tools] C --> D[Get Full Schema] D --> E[Execute Tool]

Example

# Progressive discovery workflow
servers = list_directory("./servers/")  # names only
tools = search_tools("google-drive/*",
                     detail="name+description")
schema = get_tool_definition(
    "servers/google-drive/getDocument"
)  # full JSON schema when needed

Problem

Loading all tool definitions upfront consumes excessive context window space when agents have access to large tool catalogs, limiting space for actual task execution.

Solution

Present tools through a filesystem-like hierarchy where agents discover capabilities on-demand, requesting different detail levels (name only, description, full schema) as needed.

When to use

Systems with 20+ available tools
MCP server implementations
Plugin architectures with many capabilities

Trade-offs

Pros

Reduces initial context consumption
Scales to hundreds of tools
Natural filesystem-like exploration

Cons

Adds discovery overhead
Multiple round-trips to find tools
Requires thoughtful organization

View Original →

Subagent Compilation Checker

Tool Use

Flow

Main ──▶ "Compile svc-A" ──▶ CompileSubagent ──▶ Build
                                                  │
         Main Context ◀───────────────────────────┘
           [{file:"x.go", line:10, err:"..."}]

sequenceDiagram Main->>Subagent: Compile module A Subagent->>Build: go build alt Success Subagent-->>Main: {status: ok, artifact: "a.bin"} else Failure Subagent-->>Main: [{file, line, error}] end

Example

result = spawn_compile_agent("auth-service")
# Returns concise error summary:
# [{file:"auth.go", line:85, error:"undefined"}]
main_context.inject(result.errors)

Problem

Including full build logs in main agent context blows up context length and slows inference

Solution

Spawn specialized subagents to compile each module, returning only concise error summaries to main agent

When to use

Multi-module/microservice projects
Build logs too verbose for context
Parallel compilation needed

Trade-offs

Pros

Main agent context stays clean
Parallel builds possible

Cons

Infrastructure to manage subagents
Build dependency coordination

View Original →

Virtual Machine Operator Agent

Tool Use

Flow

User ──▶ Agent ──▶ ┌──────────────────────┐
                   │    VIRTUAL MACHINE   │
                   │ ┌──────────────────┐ │
                   │ │ Execute Code     │ │
                   │ │ Install Packages │ │
                   │ │ File Operations  │ │
                   │ │ Run Applications │ │
                   │ └──────────────────┘ │
                   └──────────┬───────────┘
                              │
                   Agent ◀────┘ Results

sequenceDiagram User->>Agent: Complex Task Agent->>VM: Execute Code Agent->>VM: Install Packages Agent->>VM: File Ops VM-->>Agent: Results Agent-->>User: Task Report

Example

# Agent operating in VM environment
vm.execute("pip install pandas matplotlib")
vm.execute("python analyze_data.py")
vm.read_file("/output/report.pdf")
vm.execute("git add . && git commit -m 'Add report'")
# Agent has full computer operator capability

Problem

Agents limited to code generation cannot perform complex tasks requiring full computer environment interaction.

Solution

Give agent access to a dedicated VM environment to execute code, install packages, manage files, and run applications.

When to use

Tasks require full OS interaction
Need to install and run software
Complex multi-step system operations

Trade-offs

Pros

General-purpose digital operator
Full system capability

Cons

Higher security risk
Resource intensive

View Original →

Agentic Search Over Vector Embeddings

Tool Use & Environment

Flow

Traditional RAG:
  Code ──▶ Index ──▶ Vector DB ──▶ Query
           (stale)     (infra)

Agentic Search:
  Query ──▶ grep/find ──▶ Refine ──▶ Results
              │              │
              └──────────────┘
           (current state, no infra)

flowchart LR A[Search Query] --> B[grep/ripgrep] B --> C[Analyze Results] C --> D{Found?} D -->|No| E[Refine Search] E --> B D -->|Yes| F[Return Results]

Example

# Instead of vector search:
# vector_db.index(codebase)  # stale!
# results = vector_db.query(embed(q))

# Use agentic search:
agent.call_tool("grep", "function.*auth")
agent.call_tool("find", "**/auth/*.ts")
agent.refine_search_based_on_results()

Problem

벡터 임베딩은 지속적인 재인덱싱, 로컬 변경 처리, 인프라 오버헤드가 필요함

Solution

벡터 검색을 bash, grep, 파일 탐색을 사용하는 에이전틱 검색으로 대체 - 사전 인덱싱 불필요

When to use

자주 변경되는 코드베이스
벡터 인프라가 없을 때
보안에 민감한 배포

Trade-offs

Pros

인덱싱 유지보수 불필요
항상 현재 상태를 검색

Cons

여러 번 반복이 필요할 수 있음
매우 큰 코드베이스에서 느림

View Original →

Code-Over-API Pattern

Tool Use & Environment

Flow

Direct API (High Cost):
  Agent ──▶ API ──▶ [10K rows: 150K tokens] ──▶ Context

Code-Over-API (Low Cost):
  Agent ──▶ Write Code ──▶ Sandbox
                              │
                         [Process 10K]
                              │
                         [Filter/Agg]
                              │
                         [Summary: 2K] ──▶ Context

Example

def process_spreadsheet():
    # Fetch in execution env (not context)
    rows = spreadsheet.getRows(sheet_id="abc")
    # Filter in code
    active = [r for r in rows if r.status == "active"]
    # Only summary to context
    print(f"Found {len(active)} of {len(rows)}")
    return active[:5]  # Sample only

Problem

Direct API calls force all intermediate data through the model's context window, causing 150K+ tokens for data-heavy workflows.

Solution

Agent writes code that executes in sandbox; data processing happens in execution environment with only results returning to context.

When to use

Data-heavy workflows (spreadsheets, databases, logs)
Multi-step transformations or aggregations
Cost-sensitive applications where token usage matters

Trade-offs

Pros

Dramatic token reduction (150K to 2K reported)
Lower latency (fewer context API calls)
Natural fit for data processing tasks

Cons

Requires secure code execution infrastructure
Agents must write correct code
Debugging errors happens in execution, not context

View Original →

Visual AI Multimodal Integration

Tool Use

Flow

         ┌─────────┐
Text ────┤         │
         │ Multi   ├──▶ Cross-Modal ──▶ Solution
Image ───┤ Modal   │    Reasoning
         │ Model   │
Video ───┤         │
         └─────────┘

Image ──▶ [OCR] + [Objects] + [Spatial] ──▶ Understanding

flowchart LR A[Text Query] --> M[Multimodal LLM] B[Image] --> M C[Video] --> M M --> D[Object Detection] M --> E[OCR] M --> F[Spatial Understanding] D & E & F --> G[Combined Solution]

Example

class VisualAIAgent:
    async def process(self, task, image):
        analysis = await self.mm_llm.analyze(
            prompt=f"Task: {task}",
            image=image
        )
        # Extract: objects, OCR text, spatial info
        return await self.solve_with_visual(analysis)

Problem

Text-only agents miss critical visual information in images, videos, diagrams, and UI screenshots.

Solution

Integrate multimodal models (LMMs) to accept visual inputs, extract information, and combine with text for cross-modal reasoning.

When to use

Tasks involve images or screenshots
UI debugging or visual analysis
Document/chart understanding

Trade-offs

Pros

Enables new visual task categories
More natural show-not-tell interaction

Cons

Higher computational cost
Privacy concerns with visual data

View Original →

Abstracted Code Representation for Review

UX & Collaboration

Flow

AI Code ──▶ Abstractor ──▶ Pseudocode/Summary
   │                              │
   │         ┌────────────────────┘
   │         ▼
   │    Human Review (Intent)
   │         │
   └─────────┼──▶ Verified Code
             ▼
    "Sort changed: O(n²) → O(n log n)"

flowchart LR A[AI Generated Code] --> B[Abstraction Layer] B --> C[Pseudocode/Summary] C --> D[Human Reviews Intent] D --> E{Approved?} E -->|Yes| F[Apply Code Changes] E -->|No| G[Request Revisions]

Example

# Instead of reviewing 50 lines of code:
review = abstractor.summarize(diff)
# Output: "Changed user_list sorting from
#          bubble sort to quicksort.
#          Tests maintained."
human.verify_intent(review)  # Much faster!

Problem

AI 생성 코드를 라인 단위로 검토하는 것은 지루하고 오류가 발생하기 쉬우며, 사람은 주로 의도와 로직에 관심이 있음

Solution

실제 코드 변경과 정확히 매핑된다는 보장과 함께 추상화된 표현(의사코드, 의도 요약)을 제공

When to use

대량의 AI 생성 코드
구문 검사보다 의도 검증이 중요할 때
신뢰할 수 있는 생성 파이프라인

Trade-offs

Pros

검증 프로세스 가속화
개념적 정확성에 집중 가능

Cons

추상화 정확성 보장 필요
저수준 버그를 놓칠 수 있음

View Original →

AI-Accelerated Learning and Skill Development

UX & Collaboration

Flow

     ┌─────────────────────────────────────────────┐
     │           Fast Learning Loop                │
     └─────────────────────────────────────────────┘
              ┌──────────┐
              │  Try it  │
              └────┬─────┘
                   ▼
     ┌─────────────────────────┐
     │  AI Feedback / Explain  │◀──┐
     └───────────┬─────────────┘   │
                 ▼                 │
          ┌────────────┐           │
          │Learn & Fix │───────────┘
          └────────────┘
                 │
                 ▼
           [Skill++]

flowchart TD A[Try Code] --> B[AI Feedback] B --> C[Learn from Mistakes] C --> D[Improve & Iterate] D --> A D --> E[Skill Development]

Example

# Ask AI to explain unfamiliar code
ai.explain("What does this regex do?")
# AI explains step-by-step

# Learn from AI suggestions
ai.review(my_code)
# "Consider using list comprehension
#  instead of this loop for clarity"

Problem

코드 품질에 대한 '감각' 개발과 스킬 습득이 오랜 경험과 멘토링을 필요로 하며, 주니어 개발자에게 특히 느린 과정

Solution

AI 에이전트를 대화형 학습 도구로 활용하여 빠른 반복, 실수 학습, 베스트 프랙티스 관찰로 스킬 습득 가속화

When to use

새로운 언어/프레임워크 학습 시
코드 품질 감각을 빠르게 키우고 싶을 때
멘토 없이 독학할 때
다양한 접근법을 실험하고 싶을 때

Trade-offs

Pros

빠른 피드백 루프로 학습 가속
항상 사용 가능한 튜터
실험에 대한 두려움 감소

Cons

AI 코드 품질에 학습 품질 의존
깊은 이해 없이 표면적 학습 위험

View Original →

Chain-of-Thought Monitoring & Interruption

UX & Collaboration

Flow

Agent: "I'll modify auth.ts..."
   │
   ▼
[Start file read] ──▶ Dev watching
                         │
                    INTERRUPT!
                         │
                         ▼
Dev: "Use oauth.ts instead"
   │
   ▼
Agent: [Read oauth.ts] ──▶ Continue
   │
   ▼
Correction within first tool call

sequenceDiagram participant Dev participant Agent participant Tools Agent->>Dev: Display: "I'll modify auth.ts..." Agent->>Tools: Start file read Dev->>Agent: INTERRUPT! Wrong file Dev->>Agent: "Use oauth.ts instead" Agent->>Tools: Read oauth.ts Agent->>Dev: Display updated reasoning

Example

# Real-time reasoning visibility
stream_reasoning_to_ui(agent.thoughts)

# Low-friction interruption
if user_pressed_escape():
    agent.pause()
    context = user.get_correction()
    agent.inject_context(context)
    agent.resume()
# Partial work preserved on interrupt

Problem

Agents can pursue misguided reasoning paths for extended periods. By the time developers realize it's wrong, significant time and tokens are wasted.

Solution

Real-time surveillance of agent reasoning with low-friction interrupt capability to redirect before completing flawed sequences.

When to use

Complex refactoring where wrong file choices are costly
High-stakes operations (database migrations, API changes)
Agent might misinterpret ambiguous requirements

Trade-offs

Pros

Prevents wasted time on fundamentally wrong approaches
Maximizes value from expensive model calls
Enables collaborative human-AI problem solving

Cons

Requires active human attention (not autonomous)
Can interrupt productive exploration if triggered prematurely
Adds cognitive load to monitor reasoning

View Original →

Democratization of Tooling via Agents

UX & Collaboration

Flow

Non-Dev ──▶ "Make me a dashboard" ──▶ AI Agent ──▶ Code
   │                                       │
   ▼                                       ▼
Iterate ◀── "Add filter for date" ◀─── Working Tool

Example

# Sales team member prompt:
"Create a dashboard that shows my
 weekly pipeline from Salesforce,
 with filters by deal stage"

# Agent generates: dashboard.py
# User iterates: "Add export to CSV"

Problem

Non-engineering roles (sales, marketing, ops) need custom tools but lack programming skills to build them

Solution

AI agent translates natural language requests into code, enabling domain experts to create their own tools

When to use

Non-developers need custom tools
Simple dashboard/script creation
Quick bug fixes to existing code
Domain experts want self-automation

Trade-offs

Pros

Democratizes software access
Domain expertise directly applied

Cons

Limited for complex systems
Requires quality/security review

View Original →

Human-in-the-Loop Approval

Collaboration

Flow

                Low Risk
              ┌────────────────────────────────▶ Execute
              │
Agent ──▶ [Risk?]
              │
              │ High Risk        ┌─────┐
              └───────▶ Human ──▶│ Y/N │──┬──▶ Execute
                       ⌛       └─────┘  │
                                          └──▶ Abort / Adapt

Gate:  [auto]────────────●────────────[manual]
                     threshold

sequenceDiagram participant A as Agent participant F as Framework participant S as Slack participant H as Human A->>F: DROP table request F->>F: Classify HIGH RISK F->>S: Request approval S->>H: Notification alt Approved H->>S: Approve S->>F: Granted F->>A: Execute else Rejected H->>S: Reject + reason S->>F: Denied F->>A: Find alternative end

Implementation

from humanlayer import HumanLayer
hl = HumanLayer()

@hl.require_approval(channel="slack")
def delete_user_data(user_id: str):
    return db.users.delete(user_id)

# Execution pauses until human approves
delete_user_data("user_123")

Problem

위험한 작업(DB 삭제, 배포)을 자동 실행하면 안전/규정 문제 발생. 모든 작업을 막으면 자동화 의미 없음

Solution

고위험 작업에만 인간 승인 게이트 삽입. 안전 작업은 자동, 위험 작업만 Slack 등으로 승인 요청

When to use

DB: DELETE, DROP, ALTER
API: 결제, 이메일, 웹훅
시스템: 방화벽, 권한 변경
규정: GDPR, HIPAA, SOC2

Trade-offs

Pros

안전한 자동화
감사 추적

Cons

인간 응답 대기
승인 피로도

View Original →

Human-in-the-Loop Approval Framework

UX & Collaboration

Flow

Agent ──▶ "DROP TABLE" ──▶ Framework
                              │
                    [HIGH RISK DETECTED]
                              │
                              ▼
                         Slack #ops
                    ┌─────────────────┐
                    │ [Approve] [Deny]│
                    └─────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
         Execute + Log                   Agent Adapts

sequenceDiagram Agent->>Framework: DROP old_users Framework->>Slack: Request approval Slack->>Human: [Approve] [Reject] Human->>Slack: Approve Slack->>Framework: Granted Framework->>Agent: Execute Framework->>Log: Record

Example

from humanlayer import HumanLayer
hl = HumanLayer()

@hl.require_approval(channel="slack")
def delete_user_data(user_id: str):
    """Requires human approval before execution"""
    return db.users.delete(user_id)
# Pauses for Slack approval button click

Problem

Autonomous agents need to execute high-risk operations (DB changes, deployments), but unsupervised execution creates unacceptable safety risks.

Solution

Insert human approval gates for high-risk functions via Slack/email/SMS, maintaining agent autonomy for safe operations.

When to use

Production database operations (DELETE, DROP)
External API calls with side effects
Compliance-sensitive operations

Trade-offs

Pros

Safe autonomous risky operations
Lightweight Slack integration

Cons

Requires human availability
Risk of approval fatigue

View Original →

Latent Demand Product Discovery

UX & Collaboration

Flow

Hackable     Power Users      Analytics
Product  ──▶  "Abuse"    ──▶   Pattern   ──▶ Productize
             Features         Detection

Example: Groups → 40% buy/sell → Marketplace

flowchart LR A[Extension APIs] --> B[Power User Hacks] B --> C[Pattern Detection] C --> D[New Feature] D --> E[All Users]

Example

# Monitor creative usage patterns
if (slash_commands.custom_count > 100 and
    usage.pattern == "notification"):
    # Many users built this - productize it
    roadmap.add("Built-in notifications")

Problem

Difficult to predict which features have real product-market fit before significant engineering investment.

Solution

Build hackable products, observe how power users repurpose features, then productize validated demand.

When to use

Building extensible platforms
Feature prioritization
New product exploration

Trade-offs

Pros

Behavior-validated demand
Reduced risk

Cons

Extension infra needed
Power users != mainstream

View Original →

Spectrum of Control / Blended Initiative

UX & Collaboration

Flow

Low Autonomy          Medium              High              Async
    │                    │                   │                  │
    ▼                    ▼                   ▼                  ▼
[Tab Complete] ──▶ [Cmd-K Edit] ──▶ [Agent Mode] ──▶ [Background Agent]
  (inline)        (region/file)    (multi-file)       (full PR)

flowchart LR A[Tab Completion] --> B[Command K] B --> C[Agent Feature] C --> D[Background Agent] A -.->|Low| A D -.->|High Autonomy| D

Example

# User chooses autonomy level:
cursor.tab()       # inline assist
cursor.cmd_k()     # edit region
cursor.agent()     # multi-file task
cursor.background()# async full PR

Problem

One-size-fits-all agent autonomy doesn't fit varying task complexity or user familiarity

Solution

Provide a spectrum of control modes from low (tab-complete) to high (background agent) that users can switch between

When to use

IDE/editor integrations
Varying task complexity within one session
Users with different comfort levels

Trade-offs

Pros

Flexible for all task sizes
User maintains desired control level

Cons

UX complexity with multiple modes
Learning curve for mode switching

View Original →

Team-Shared Agent Configuration

UX & Collaboration

Flow

                    ┌─────────────────────────┐
                    │   .claude/settings.json │
                    │   (version controlled)  │
                    └───────────┬─────────────┘
                                │
         ┌──────────────────────┼──────────────────────┐
         │                      │                      │
         ▼                      ▼                      ▼
    [Dev A] ──▶ git pull ──▶ [Dev B] ──▶ git pull ──▶ [Dev C]
         │                      │                      │
         └──────────────────────┴──────────────────────┘
                    Same Config Everywhere

flowchart TD R[settings.json in Git] --> A[Dev A] R --> B[Dev B] R --> C[Dev C] A --> |git pull| R B --> |git pull| R C --> |git pull| R

Example

// .claude/settings.json
{
  "permissions": {
    "pre_allowed": ["git add", "npm test"],
    "blocked_paths": [".env", "secrets/"]
  },
  "hooks": { "pre_commit": "./run_tests.sh" }
}

Problem

Independent agent configs per developer cause inconsistent behavior, permission friction, and duplicated effort across the team.

Solution

Store agent configuration in version control as code, enabling team-wide sharing via git pull.

When to use

Multiple team members use AI agents
Consistent agent behavior is critical
New members need quick onboarding

Trade-offs

Pros

Consistent team experience
Faster onboarding

Cons

Less individual flexibility
Config sprawl over time

View Original →

Agent-Assisted Scaffolding

UX & Collaboration

Flow

Developer ──▶ "Create user API" ──▶ Agent
                                        │
              ┌─────────────────────────┘
              ▼
         Generated Files:
         ├── routes/user.ts
         ├── controllers/user.ts
         ├── models/user.ts
         └── tests/user.test.ts
              │
              ▼
         Developer: Implement Core Logic

flowchart TD A[High-level Description] --> B[Agent Scaffolding] B --> C[Routes] B --> D[Controllers] B --> E[Models] B --> F[Tests] C & D & E & F --> G[Developer Implements Logic]

Example

agent.scaffold(
    description="User profile API endpoint",
    framework="fastapi",
    include=["routes", "models", "tests"]
)
# Agent generates structure
# Developer fills in business logic

Problem

새 기능 시작 시 반복적인 보일러플레이트와 기초 코드 작성이 필요하여 시간이 많이 소요됨

Solution

AI를 사용해 고수준 설명으로부터 초기 구조, 파일, 보일러플레이트를 생성한 후 핵심 로직에 집중

When to use

새 기능 또는 모듈 개발
반복적인 프로젝트 설정 작업
일관된 구조가 필요할 때

Trade-offs

Pros

빠른 프로젝트 시작
일관된 초기 구조

Cons

기존 컨벤션과 불일치할 수 있음
생성된 코드 검토 필요

View Original →

Seamless Background-to-Foreground Handoff

UX & Collaboration

Flow

User: "Refactor X" ──▶ [Background Agent]
                              │
                              ▼
                       Proposes PR (90%)
                              │
           ┌──────────────────┴──────────────────┐
           ▼                                     ▼
     100% Correct ──▶ Done           90% Correct ──▶ Take Over
                                                       │
                                                       ▼
                                          User + Foreground Tools
                                                       │
                                                       ▼
                                                  Finalized PR

flowchart TD A[User Request] --> B[Background Agent] B --> C[Proposed PR] C --> D{Review} D -->|100%| E[Done] D -->|90%| F[Take Over] F --> G[Foreground Edit] G --> E

Example

# Background agent completes task
pr = background_agent.work("Refactor module X")

# User reviews and takes control if needed
if user.review(pr) != "approved":
    # Seamless handoff with context
    foreground = pr.take_control()
    foreground.edit_with_ai_assist()
    foreground.finalize()

Problem

Background agents can handle complex tasks autonomously but may achieve only 90% correctness. A clunky handoff process to human control negates the automation benefits.

Solution

Design systems allowing seamless transition from background agent work to foreground human control, preserving context so users can refine the remaining 10% efficiently.

When to use

Background agents producing near-complete work
Tasks requiring human finesse for completion
Developer workflows with autonomous PR generation

Trade-offs

Pros

Leverages autonomous processing power
Retains human control for final touches
Preserves context across transition

Cons

Requires careful UX design
Context handoff complexity
May create workflow interruptions

View Original →

Verbose Reasoning Transparency

UX & Collaboration

Flow

User ──▶ Complex Task ──▶ Agent ──▶ [Standard Output]
                                         │
                               ┌─────────┘
                               │  Ctrl+R
                               ▼
                    ┌──────────────────────┐
                    │  VERBOSE VIEW        │
                    │  - Reasoning steps   │
                    │  - Tool selection    │
                    │  - Confidence scores │
                    │  - Raw tool outputs  │
                    └──────────────────────┘

sequenceDiagram User->>Agent: Task Agent-->>User: Standard output User->>UI: Ctrl+R UI-->>User: Reasoning steps UI-->>User: Tool rationale UI-->>User: Raw outputs

Example

# Verbose mode output example
{
    "interpretation": "User wants to refactor...",
    "tools_considered": ["grep", "ast-parse", "sed"],
    "tool_selected": "ast-parse",
    "reason": "Need semantic understanding",
    "confidence": 0.87
}

Problem

Complex agents behave like black boxes; users cannot understand why decisions were made or debug unexpected behavior.

Solution

Provide on-demand verbose mode (e.g., Ctrl+R) showing reasoning steps, tool selection rationale, and raw outputs.

When to use

Debugging unexpected agent behavior
Building trust in agent decisions
Learning effective prompting

Trade-offs

Pros

Better debugging and trust
Helps users improve prompts

Cons

Information overload risk
May expose internal complexity

View Original →

Agent-Friendly Workflow Design

UX & Collaboration

Flow

Traditional ──▶ Agent-Friendly?
     │               │
     │     ┌─────────┴─────────┐
     │     │                   │
     ▼     ▼                   ▼
  Redesign:              Already OK
  ├── Clear Goals           │
  ├── Autonomy              │
  ├── Structured I/O        │
  └── Feedback Loops        │
          │                 │
          └────────┬────────┘
                   ▼
            Enhanced Performance

flowchart TD A[Traditional Workflow] --> B{Agent-Friendly?} B -->|No| C[Redesign] C --> D[Clear Goals] C --> E[Appropriate Autonomy] C --> F[Structured I/O] C --> G[Feedback Loops] D & E & F & G --> H[Optimized Workflow] B -->|Yes| H

Example

# Bad: Micromanaging
agent.do("Use React, then add useState...")

# Good: Goal-oriented with autonomy
agent.do(
    goal="Build login form",
    constraints=["use existing auth"],
    freedom="choose implementation"
)

Problem

워크플로우가 너무 경직되거나 사람이 기술적 결정을 마이크로매니징하면 에이전트가 어려움을 겪음

Solution

명확한 목표, 적절한 자율성, 구조화된 I/O, 반복적인 피드백 루프를 가진 워크플로우 설계

When to use

에이전트를 기존 프로세스에 통합할 때
에이전트 성능이 최적이 아닐 때
새로운 인간-AI 워크플로우 구축 시

Trade-offs

Pros

에이전트 역량 극대화
더 나은 인간-AI 협업

Cons

워크플로우 재설계 필요
신뢰 구축이 필요할 수 있음

View Original →