{"part1-setup.md": "# Part 1: Setup (Stop Fumbling With Installation)\n\n*From zero to working agent in under 5 minutes. Covers what the docs don't.*\n\n---\n\n## The Install\n\nOne command. That's it.\n\n### Linux / macOS / WSL2\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash\n```\n\n> **Security tip:** Piping scripts directly from the internet to bash executes them sight-unseen. If you prefer to inspect first:\n> ```bash\n> curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh -o install.sh\n> less install.sh   # Review the script\n> bash install.sh\n> ```\n\n> **Windows users:** Native Windows is not supported. Install [WSL2](https://learn.microsoft.com/en-us/windows/wsl/install) and run the command from inside WSL. It works perfectly.\n\n### What the Installer Does\n\nThe installer handles everything automatically:\n\n- Installs **uv** (fast Python package manager)\n- Installs **Python 3.11** via uv (no sudo needed)\n- Installs **Node.js v22** (for browser automation)\n- Installs **ripgrep** (fast file search) and **ffmpeg** (audio conversion)\n- Clones the Hermes repo\n- Sets up the virtual environment\n- Creates the global `hermes` command\n- Runs the setup wizard for LLM provider configuration\n\nThe only prerequisite is **Git**. Everything else is handled for you.\n\n### After Installation\n\n```bash\nsource ~/.bashrc   # or: source ~/.zshrc\nhermes             # Start chatting!\n```\n\n---\n\n## First-Run Configuration\n\nThe setup wizard (`hermes setup`) walks you through:\n\n### 1. Choose Your LLM Provider\n\n```bash\nhermes model\n```\n\nSupported providers:\n\n| Provider | Best For | Env Variable |\n|----------|----------|-------------|\n| Anthropic (Claude) | Highest quality, best at complex tasks | `ANTHROPIC_API_KEY` |\n| OpenAI (GPT-4.1/o3) | Strong tool use, fast | `OPENAI_API_KEY` |\n| OpenRouter | Access 100+ models from one key | `OPENROUTER_API_KEY` |\n| Cerebras | Fast inference, good for simple tasks | `CEREBRAS_API_KEY` |\n| Groq | Very fast, limited context | `GROQ_API_KEY` |\n| xAI (Grok) | Good balance of speed/quality | `XAI_API_KEY` |\n| Google (Gemini) | Huge context, cheap | `GEMINI_API_KEY` |\n\nYou can configure **multiple providers** with automatic fallback. If one goes down, Hermes switches to the next.\n\n### 2. Set Your API Keys\n\n```bash\nhermes auth\n```\n\nThis opens an interactive menu to add API keys for each provider. Keys are stored in `~/.hermes/.env` \u2014 never committed to git.\n\n> **Tip:** You can also set keys manually:\n> ```bash\n> echo \"ANTHROPIC_API_KEY=<your-key-here>\" >> ~/.hermes/.env\n> chmod 600 ~/.hermes/.env   # Restrict access to your user only\n> ```\n>\n> **Important:** Always run `chmod 600 ~/.hermes/.env` to prevent other users on the system from reading your API keys.\n\n### 3. Configure Toolsets\n\n```bash\nhermes tools\n```\n\nThis opens an interactive TUI to enable/disable tool categories:\n\n- **core** \u2014 File read/write, terminal, web search\n- **web** \u2014 Browser automation, web extraction\n- **browser** \u2014 Full browser control (requires Node.js)\n- **code** \u2014 Code execution sandbox\n- **delegate** \u2014 Sub-agent spawning for parallel work\n- **skills** \u2014 Skill discovery and creation\n- **memory** \u2014 Memory search and management\n\n> **Recommendation:** Enable `core`, `web`, `skills`, and `memory` at minimum. Add `browser` and `code` if you need automation or sandboxed execution.\n\n---\n\n## Key Config Options\n\nAfter initial setup, fine-tune with `hermes config set`:\n\n### Model Settings\n\n```bash\n# Set primary model\nhermes config set model anthropic/claude-sonnet-4-20250514\n\n# Set fallback model (used when primary is rate-limited)\nhermes config set fallback_models '[\"openrouter/anthropic/claude-sonnet-4-20250514\"]'\n```\n\n### Agent Behavior\n\n```bash\n# Max turns per conversation (default: 90)\nhermes config set agent.max_turns 90\n\n# Verbose mode: off, on, or full\nhermes config set agent.verbose off\n\n# Quiet mode (less terminal output)\nhermes config set agent.quiet_mode true\n```\n\n### Context Management\n\n```bash\n# Enable prompt caching (reduces cost on repeated context)\nhermes config set prompt_caching.enabled true\n\n# Context compression (auto-summarize old messages)\nhermes config set context_compression.enabled true\n```\n\n---\n\n## File Locations\n\nEverything lives under `~/.hermes/`:\n\n```\n~/.hermes/\n\u251c\u2500\u2500 config.yaml          # Main configuration\n\u251c\u2500\u2500 .env                 # API keys (never commit this)\n\u251c\u2500\u2500 SOUL.md             # Agent personality (injected every message)\n\u251c\u2500\u2500 memories/           # Long-term memory entries\n\u251c\u2500\u2500 skills/             # Skills (auto-discovered)\n\u251c\u2500\u2500 skins/              # CLI themes\n\u251c\u2500\u2500 audio_cache/        # TTS audio files\n\u251c\u2500\u2500 logs/               # Session logs\n\u2514\u2500\u2500 hermes-agent/       # Source code (git repo)\n```\n\n> **Important:** `SOUL.md` is injected into every message. Keep it under 1 KB. Every byte costs latency and tokens.\n\n---\n\n## Verify Your Setup\n\n```bash\n# Check everything is working\nhermes status\n\n# Quick test\nhermes chat -q \"Say hello and confirm you're working\"\n```\n\nExpected output: Hermes responds with a greeting, confirming the model connection, tool availability, and session initialization.\n\n---\n\n## Updating\n\n```bash\nhermes update\n```\n\nThis pulls the latest code, updates dependencies, migrates config, and restarts the gateway. Run it regularly \u2014 Hermes ships frequent improvements.\n\n---\n\n## What's Next\n\n- **Coming from OpenClaw?** \u2192 [Part 2: OpenClaw Migration](./part2-openclaw-migration.md)\n- **Want smarter memory?** \u2192 [Part 3: LightRAG Setup](./part3-lightrag-setup.md)\n- **Need mobile access?** \u2192 [Part 4: Telegram Setup](./part4-telegram-setup.md)\n- **Want the agent to self-improve?** \u2192 [Part 5: On-the-Fly Skills](./part5-creating-skills.md)\n", "part6-context-compression.md": "# Part 6: Context Compression (Don't Lose Your Context Silently)\n\n*Long sessions degrade. Context compression fixes this \u2014 but only if it works correctly.*\n\n---\n\n## The Problem\n\nHermes injects context every message: memory, skills, tool results, conversation history. In long sessions, this grows until you hit the context window limit and the agent freezes or starts forgetting.\n\nContext compression automatically summarizes older messages to keep the context lean. But there's a bug in the default implementation that can silently drop context.\n\n## The Bug\n\nIn `context_compressor.py`, when summarization fails (API timeout, model error, rate limit), the compressor **silently discards the messages it was trying to summarize** instead of preserving them. You lose context with no warning.\n\n**Symptoms:**\n- Agent suddenly \"forgets\" something it knew 20 messages ago\n- Long sessions degrade faster than expected\n- No error messages \u2014 it just quietly loses data\n\n## The Fix\n\nFind your `context_compressor.py`:\n\n```bash\nfind ~/.hermes -name \"context_compressor.py\" -type f\n```\n\nLook for the compression function. The bug is in the error handling around the summarization call. It should look something like:\n\n```python\n# BROKEN \u2014 silently drops context on failure\ntry:\n    summary = await summarize_messages(messages_to_compress)\n    compressed_context = summary\nexcept Exception:\n    compressed_context = \"\"  # THIS IS THE BUG \u2014 empty string = data lost\n```\n\nFix it by **aborting compression on failure** instead:\n\n```python\n# FIXED \u2014 preserves original context if compression fails\ntry:\n    summary = await summarize_messages(messages_to_compress)\n    compressed_context = summary\nexcept Exception as e:\n    logger.warning(f\"Context compression failed: {e}, preserving original context\")\n    return original_context  # Don't compress, don't lose data\n```\n\n**The rule:** If compression can't succeed, keep the uncompressed context. A slower response is better than a wrong one.\n\n## When Compression Triggers\n\n- Default: when context reaches ~80% of the model's window\n- Configurable in `~/.hermes/.env`:\n\n```bash\n# Percentage of context window to trigger compression (default: 80)\nCONTEXT_COMPRESSION_THRESHOLD=80\n\n# Minimum messages before compression activates (default: 20)\nCONTEXT_COMPRESSION_MIN_MESSAGES=20\n```\n\n## Best Practices\n\n- **Let it compress.** Don't set the threshold to 99% \u2014 compression needs headroom to work.\n- **Monitor long sessions.** If the agent starts forgetting things mid-conversation, check if compression silently failed.\n- **Restart fresh for critical work.** If you're doing something important, start a new session rather than running on a 100-message compressed context.\n- **Use `session_search` to recall.** If you lost context to compression, `session_search` can find it in past transcripts.\n\n---\n\n*This bug affects all Hermes versions before the fix. Patch it immediately if you run long sessions.*\n", "part7-memory-system.md": "# Part 7: The Memory System (Three Tiers That Actually Work)\n\n*Hermes has three memory systems. Most people only know about one.*\n\n---\n\n## The Three Tiers\n\n| Tool | What It Does | When It Fires | Cost |\n|------|-------------|---------------|------|\n| `memory` | Persistent facts across all sessions | User preferences, environment, lessons learned | Free (local) |\n| `session_search` | Search past conversation transcripts | \"What did we decide about X?\" or \"Remember when we...\" | Free (local) |\n| `skill_manage` | Procedural memory \u2014 reusable workflows | After fixing a bug, building something complex, or discovering a new approach | Free (local) |\n\nAll three are **local-first**. No API calls, no embedding costs. They use SQLite and full-text search.\n\n## Tier 1: memory (Persistent Facts)\n\nThe `memory` tool saves durable facts that get injected into every future session.\n\n**What to save:**\n- User preferences (\"Terp hates manual steps\")\n- Environment details (\"5090 PC at 192.168.1.67, port 11434\")\n- Tool quirks (\"PowerShell needs -Encoding utf8 for Unicode files\")\n- Stable conventions (\"Use OnlyTerp for GitHub repos\")\n\n**What NOT to save:**\n- Task progress (use session_search to recall)\n- Temporary state (TODO lists, current status)\n- Anything that changes frequently\n\n**Format:** Keep entries under 2000 chars total. Be compact. These get injected into every message.\n\n```python\n# Good\nmemory(action=\"add\", target=\"memory\", content=\"OpenClaw migrated. LightRAG: 4528 entities, float16 vectors (4096d). Telegram bot 8624585264, group -5216536760.\")\n\n# Bad \u2014 too verbose, task-specific\nmemory(action=\"add\", target=\"memory\", content=\"Today I worked on the lead gen pipeline. First I fixed the API key issue, then I updated the quality gate scoring to use a new algorithm, then I tested with 50 leads...\")\n```\n\n## Tier 2: session_search (Conversation Recall)\n\n`session_search` searches your entire conversation history across all past sessions.\n\n**Two modes:**\n\n```python\n# Browse recent sessions (no cost, instant)\nsession_search()\n\n# Search for specific topics (uses LLM to summarize)\nsession_search(query=\"hermes optimization guide github\")\nsession_search(query=\"LightRAG setup OR embedding model\")\n```\n\n**When to use it:**\n- User says \"we did this before\" or \"remember when\"\n- You suspect relevant cross-session context exists\n- You want to check if you've solved a similar problem before\n\n**Key insight:** session_search is your recency backup. memory is for facts that will still matter in 6 months. If a fact is only relevant to the current project phase, session_search is better than bloating memory.\n\n## Tier 3: skill_manage (Procedural Memory)\n\n`skill_manage` saves reusable workflows as skills. This is how Hermes learns.\n\n**When to create a skill:**\n- After a complex task (5+ tool calls)\n- After fixing a tricky error\n- After discovering a non-trivial workflow\n- When the user asks you to remember a procedure\n\n```python\n# Create a new skill\nskill_manage(\n    action=\"create\",\n    name=\"supabase-migrate\",\n    content=\"---\\ndescription: Run Supabase SQL migrations via Management API\\n---\\n\\n# Supabase Migration\\n\\n1. Read the SQL file from supabase/migrations/\\n2. Use Python http.client to POST to Management API...\",\n    category=\"devops\"\n)\n\n# Patch an existing skill when you find issues\nskill_manage(\n    action=\"patch\",\n    name=\"supabase-migrate\",\n    old_string=\"Use requests.post\",\n    new_string=\"Use http.client (requests has timeout issues with Supabase)\"\n)\n```\n\n**Key rules:**\n- Skills must have trigger conditions \u2014 when should this skill load?\n- Skills must have numbered steps \u2014 what exactly to do?\n- Skills must have pitfalls \u2014 what can go wrong?\n- Patch skills immediately when you find issues \u2014 don't wait to be asked\n\n## How They Work Together\n\n```\nUser asks a question\n    \u2193\nmemory injects persistent context (user prefs, environment)\n    \u2193\nsession_search recalls relevant past conversations (if needed)\n    \u2193\nskill_manage loads procedural knowledge (if triggered)\n    \u2193\nAgent has full context \u2192 better answer\n```\n\n**The hierarchy:** memory is always on. session_search is on-demand. skill_manage is triggered by task matching.\n\n## Anti-Patterns\n\n| Don't Do This | Do This Instead |\n|--------------|-----------------|\n| Save task progress to memory | Use session_search to recall |\n| Create a skill for a one-off task | Just do it, skip the skill |\n| Dump raw data into memory | Save compact, durable facts |\n| Search session_search for everything | Check memory first, it's free and instant |\n| Let skills go stale | Patch them immediately when outdated |\n\n---\n\n*Memory is what separates a stateless chatbot from an actual agent. Use all three tiers.*\n", "part8-subagent-patterns.md": "# Part 8: Subagent & Orchestrator Patterns (Stop Doing Everything Yourself)\n\n*One agent can't do everything well. Delegate.*\n\n---\n\n## The Core Idea\n\nHermes is the orchestrator. It decides what to do, then delegates execution to specialized subagents. Each subagent runs in isolation \u2014 own context, own tools, own session.\n\n**When to delegate:**\n- Reasoning-heavy tasks (debugging, code review, research)\n- Tasks that would flood your context with intermediate data\n- Parallel independent workstreams (research A and B simultaneously)\n\n**When NOT to delegate:**\n- Single tool calls (just call the tool directly)\n- Simple tasks that need 1-2 steps\n- Tasks needing user interaction (subagents can't use clarify)\n\n## delegate_task \u2014 The Main Tool\n\n```python\n# Single task\ndelegate_task(\n    goal=\"Debug why the API returns 403 on POST requests\",\n    context=\"File: src/api/client.py. Error started after adding auth headers. Token is valid.\",\n    toolsets=[\"terminal\", \"file\"]\n)\n\n# Parallel batch (up to 3)\ndelegate_task(\n    tasks=[\n        {\n            \"goal\": \"Research LightRAG alternatives for graph RAG\",\n            \"toolsets\": [\"web\"]\n        },\n        {\n            \"goal\": \"Benchmark current LightRAG search latency\",\n            \"context=\"Path: ~/.hermes/skills/research/lightrag/\",\n            \"toolsets\": [\"terminal\"]\n        },\n        {\n            \"goal\": \"Check if our embedding model has a newer version\",\n            \"toolsets\": [\"web\"]\n        }\n    ]\n)\n```\n\n**Key details:**\n- Subagents have NO memory of your conversation. Pass everything via `context`.\n- Results come back as a summary. Intermediate tool calls never enter your context.\n- Each subagent gets its own terminal session.\n- Default max iterations: 50. Lower it for simple tasks (`max_iterations=10`).\n\n## The CEO/COO/Worker Pattern\n\n```\nCEO (you + Hermes main agent)\n  \u2502\n  \u251c\u2500\u2500 COO (delegate_task for planning/review)\n  \u2502     \u2514\u2500\u2500 Returns: strategy, plan, review notes\n  \u2502\n  \u2514\u2500\u2500 Workers (delegate_task for execution)\n        \u251c\u2500\u2500 Worker 1: Build feature A\n        \u251c\u2500\u2500 Worker 2: Build feature B\n        \u2514\u2500\u2500 Worker 3: Write tests\n```\n\n**CEO:** Makes decisions, assigns tasks, reviews results.\n**COO:** Researches, plans, reviews code. One subagent, reasoning-heavy.\n**Workers:** Execute specific tasks in parallel. Multiple subagents, action-heavy.\n\n## ACP Subagents (Claude Code, Codex)\n\nFor coding tasks, delegate to dedicated coding agents via ACP:\n\n```python\n# Claude Code\ndelegate_task(\n    goal=\"Implement the user settings page with React\",\n    context=\"Repo at /home/terp/my-app. Use existing component library in src/components/\",\n    acp_command=\"claude\",\n    acp_args=[\"--acp\", \"--stdio\", \"--model\", \"claude-sonnet-4-20250514\"]\n)\n\n# Codex\ndelegate_task(\n    goal=\"Refactor database layer to use connection pooling\",\n    context=\"File: src/db/connection.py. Currently opens new connection per query.\",\n    acp_command=\"codex\"\n)\n```\n\n**When to use ACP vs regular delegate_task:**\n- ACP agents (Claude Code, Codex) are better at coding \u2014 tool calling, file editing, running tests\n- Regular delegate_task is better for research, analysis, and multi-tool workflows\n- ACP agents are faster for single-file edits\n\n## SWE-1.6 via Windsurf Cascade\n\nFor complex coding tasks, use Windsurf's SWE-1.6:\n\n```python\n# Send a coding task to Windsurf Cascade\n# Requires Windsurf running with --remote-debugging-port=9222\nsubprocess.run([\n    \"python\", \n    \"~/.hermes/skills/autonomous-ai-agents/windsurf-cascade/scripts/cascade_send.py\",\n    \"Build a React dashboard with real-time WebSocket updates\"\n])\n```\n\n**Orchestrator pattern:** Hermes handles APIs, data, decisions. SWE-1.6 handles UI, components, bug fixes. Each does what it's best at.\n\n## Parallelization Rules\n\n| Scenario | Approach |\n|----------|----------|\n| 3 independent research tasks | Batch `delegate_task` with `tasks` array |\n| 1 complex coding task | ACP subagent (Claude Code or Codex) |\n| Multiple code changes in different files | SWE-1.6 via Cascade |\n| Single API call | Just call the tool, don't delegate |\n| Task needs user input | Do it yourself, can't delegate interactive work |\n\n## Common Mistakes\n\n| Mistake | Fix |\n|---------|-----|\n| Delegating a single tool call | Just call the tool directly |\n| Not passing enough context to subagent | Subagents know nothing \u2014 pass file paths, error messages, constraints |\n| Delegating sequential tasks in parallel | If task B depends on task A's output, run them sequentially |\n| Setting max_iterations too high | Simple tasks don't need 50 iterations \u2014 use 10-15 |\n| Forgetting subagents can't use clarify | If a task might need clarification, do it yourself |\n\n---\n\n## What's Next (April 2026 Additions)\n\nThe subagent system has grown rapidly. Continue with:\n\n- **[Part 18: Delegating to Coding Agents](./part18-coding-agents.md)** \u2014 the OpenClaw pattern (thread-bound Telegram topics \u2192 persistent Claude Code / Codex / Gemini CLI runtimes). Print-mode vs interactive, ACP-as-server, git branch isolation, routing rules.\n- **[Part 17: MCP Servers](./part17-mcp-servers.md)** \u2014 give subagents tools that stay in sync across Hermes, Claude Code, and Cursor.\n- **[Part 21: Remote Sandboxes](./part21-remote-sandboxes.md)** \u2014 run your subagents on Modal/Daytona/SSH so a $5 VPS can drive a beefy workspace.\n- **[Part 20: Observability](./part20-observability.md)** \u2014 trace every subagent call in Langfuse, with per-skill cost breakdown.\n\n---\n\n*The orchestrator pattern is how you scale. One brain, many hands.*\n", "part9-custom-models.md": "# Part 9: Custom Model Providers (Use Any Model You Want)\n\n*Hermes supports any OpenAI-compatible API, plus first-class native adapters for Nous Portal, xAI, Xiaomi MiMo, Kimi/Moonshot, z.ai/GLM, MiniMax, Arcee, Hugging Face, Cerebras, Groq, Fireworks, and Ollama. OAuth providers landing post-v0.10 add Gemini CLI (free tier: 1500 req/day), Qwen, and Claude Code Pro/Max. This is the up-to-date (April 17, 2026) cheat sheet.*\n\n> **What's new since v0.10.0** \u2014 [Gemini CLI OAuth inference provider](https://github.com/NousResearch/hermes-agent/pull/11270) (#11270), [Gemini TTS provider](https://github.com/NousResearch/hermes-agent/pull/10922), [multi-model FAL image gen](https://github.com/NousResearch/hermes-agent/pull/11265), [GLM 5.1 in OpenCode Go catalogs](https://github.com/NousResearch/hermes-agent/pull/11269), [Azure OpenAI GPT-5.x on chat/completions](https://github.com/NousResearch/hermes-agent/pull/10086), plus [TCP keepalives](https://github.com/NousResearch/hermes-agent/pull/11277) that detect dead provider connections before you notice the hang. All shipping on `main`, targeted for v0.11.\n\n---\n\n## Native Adapters vs Generic OpenAI-Compatible\n\nAs of v0.10.0 (April 2026), Hermes ships **native adapters** for a growing list of providers. Native adapters know about provider-specific features that a generic OpenAI-compatible wrapper can't:\n\n| Provider | Native adapter? | Notable feature |\n|----------|-----------------|-----------------|\n| **Nous Portal** | Yes | Auth via `hermes model` (no bare API key). Unlocks the [Tool Gateway](./part13-tool-gateway.md). |\n| **Anthropic** | Yes | Native prompt caching, extended thinking, `/fast` priority tier |\n| **OpenAI** | Yes | Native responses API, reasoning effort levels, `/fast` priority tier |\n| **xAI (Grok)** | **Yes, new in v0.10** | Native **live X/Twitter search** as a built-in tool |\n| **Xiaomi MiMo** | **Yes, new in v0.10** | Native reasoning modes (`low`/`medium`/`high`) exposed as config |\n| **Kimi / Moonshot** | Yes | 200K+ context, great for LightRAG entity extraction (see [Part 3](#part-3-lightrag--graph-rag-that-actually-works)) |\n| **z.ai / GLM** | Yes | **GLM 5.1** (added to OpenCode Go catalogs [#11269](https://github.com/NousResearch/hermes-agent/pull/11269)) \u2014 currently strongest open-weights model for tool use |\n| **Google Gemini (direct)** | Yes | 1M context; native prompt caching on Gemini 2.5 Pro |\n| **Google Gemini CLI (OAuth)** | **Yes, new post-v0.10** | OAuth via `gemini auth` \u2014 **1500 requests/day free tier**. [#11270](https://github.com/NousResearch/hermes-agent/pull/11270) |\n| **MiniMax** | Yes | M2.7 \u2014 balanced speed/quality; native streaming |\n| **Arcee** | Yes | AFM-4.5 function-calling specialist, cheap |\n| **Cerebras** | Yes | 2000+ tok/s inference |\n| **Groq** | Yes | Fast hosted Llama / Qwen |\n| **Qwen (OAuth)** | Yes | OAuth via portal-request flow, free-tier available |\n| **Fireworks** | Yes | Qwen3-Embedding-8B (recommended for LightRAG) |\n| **Azure OpenAI** | Yes | GPT-5.x now via `/chat/completions` (was `/responses` only) [#10086](https://github.com/NousResearch/hermes-agent/pull/10086) |\n| **Hugging Face** | Yes | Any TGI / TEI endpoint (self-hosted or Inference Endpoints) |\n| **OpenRouter** | Yes | Pass-through to 200+ models; respects native adapter quirks when downstream is one |\n| **Ollama** (local) | Generic | OpenAI-compatible, zero auth |\n| **Anything else** | Generic | Any OpenAI-compatible `base_url` |\n\nPick the native adapter when one exists \u2014 you get the provider-specific features for free. Fall back to the generic OpenAI-compatible path only for endpoints that don't have a native adapter yet.\n\n### Flagship Model Cheat Sheet (April 17, 2026)\n\nFor the \"which model should I pick right now?\" question, this is the current state of the world:\n\n| Model | Provider | Input / Output ($/MTok) | Context | Best for |\n|-------|----------|------------------------|---------|----------|\n| **Claude Sonnet 4.5** | Anthropic | $3 / $15 | 200K | Default for coding, refactor, multi-step reasoning |\n| **Claude Opus 4** | Anthropic | $15 / $75 | 200K | The hardest reasoning only; $15/MTok stings fast |\n| **Claude Mythos** (Cyber) | Anthropic | Invite-only | 200K | Security research \u2014 vulnerability discovery, malware triage |\n| **GPT-5.4** | OpenAI | $5 / $20 | 256K | Reasoning heavy-lift, agentic long chains |\n| **GPT-5.4-Cyber** | OpenAI | Trusted Access only | 256K | Defensive cybersec workflows, reverse engineering |\n| **GPT-5.4 Mini** | OpenAI | $0.60 / $4.80 | 256K | Cheap reasoning fallback |\n| **Gemini 2.5 Pro** | Google / OpenRouter | $1.25 / $10 | 1M | Long-context, whole-repo reads, research synthesis |\n| **Gemini 3 Flash Preview** | Google / OpenRouter | $0.50 / $3 | 1M | Fast agentic reasoning with 1M window |\n| **Gemini 2.5 Flash** | Google / OpenRouter | $0.30 / $2.50 | 1M | Classification, triage, bulk extraction |\n| **Kimi K2.5** | Moonshot | ~$0.15 / $2.50 | 200K | Best price/quality for coding in 2026 |\n| **GLM 5.1** | z.ai | ~$0.20 / $2 | 128K | Strongest open-weights tool use |\n| **xAI Grok 4** | xAI | $3 / $15 | 256K | Native live-X search; current-events questions |\n| **Xiaomi MiMo** | Xiaomi | $0.50 / $3 | 200K | Three-mode reasoning toggle (low/med/high) |\n| **MiniMax M2.7** | MiniMax | $10/mo flat | 256K | Flat-rate users doing bulk work |\n| **Cerebras Llama 3.3 70B** | Cerebras | $0.60 / $0.60 | 128K | 3000+ tok/s \u2014 interactive chat, fast classification |\n| **Local Nemotron 30B** | Ollama | Free | 128K | Privacy, offline, embedding, session search |\n\n> Prices are current per-provider retail as of April 17, 2026. Batch and prompt-caching discounts are not included \u2014 stack them via [Part 20](./part20-observability.md#rule-2-prompt-caching-is-free-money).\n\n---\n\n### Nous Portal \u2014 OAuth, Not an API Key\n\nNous Portal uses an OAuth flow via `hermes model` instead of a bare API key. After auth, credentials live in `~/.hermes/auth.json` (never in `.env`). Re-auth when it expires:\n\n```bash\nhermes model\n# Pick \"Nous Portal\" \u2192 complete the browser OAuth flow\n```\n\nIf you're on a paid subscription, the setup also offers to enable the [Tool Gateway](./part13-tool-gateway.md) \u2014 web search, image gen, TTS, and browser automation through your subscription, no extra keys needed.\n\n### Gemini CLI OAuth \u2014 Free 1500 req/day\n\nIf you have a Google account, skip the API key entirely and sign in with OAuth:\n\n```bash\nnpm install -g @google/gemini-cli\ngemini auth\nhermes model\n# Pick \"Gemini CLI (OAuth)\" \u2014 Hermes detects the logged-in session\n```\n\nHermes drives Gemini via the local CLI. You get 1500 requests/day on the free tier \u2014 plenty for exploration, classification, and Gemini's killer long-context reads. Merged in [#11270](https://github.com/NousResearch/hermes-agent/pull/11270) (April 16, 2026).\n\n### Gemini TTS \u2014 7th Voice Provider\n\nAs of [#10922](https://github.com/NousResearch/hermes-agent/issues/10922) (merged April 16), Gemini joins Edge, ElevenLabs, OpenAI, MiniMax, Mistral, and NeuTTS as a TTS backend:\n\n```yaml\ntts:\n  gemini:\n    model: gemini-2.5-flash-preview-tts\n    voice: Kore\n```\n\n`GEMINI_API_KEY` or `GOOGLE_API_KEY` is enough. Output comes back as PCM, wrapped in WAV natively (no extra deps), optionally converted to mp3/ogg via `ffmpeg`. Works for Telegram voice bubbles out of the box.\n\n---\n\n## config.yaml Structure\n\nModels are configured in `~/.hermes/config.yaml`:\n\n> **Security note:** Never put real API keys directly in `config.yaml`. Use environment variable references so keys stay in `~/.hermes/.env` (which should be `chmod 600` and never committed to git).\n\n```yaml\n# Default model\nmodel: claude-sonnet-4-20250514\nprovider: anthropic\n\n# Provider configurations\nproviders:\n  anthropic:\n    api_key: ${ANTHROPIC_API_KEY}\n\n  openai:\n    api_key: ${OPENAI_API_KEY}\n\n  xai:                                # Native adapter (v0.10+)\n    api_key: ${XAI_API_KEY}\n    live_search: true                 # Grok's live X/Twitter search\n\n  xiaomi:                             # Native adapter (v0.10+)\n    api_key: ${XIAOMI_API_KEY}\n    reasoning_mode: high              # low / medium / high\n\n  moonshot:                           # Kimi\n    api_key: ${MOONSHOT_API_KEY}\n\n  zai:                                # z.ai / GLM\n    api_key: ${ZAI_API_KEY}\n\n  minimax:\n    api_key: ${MINIMAX_API_KEY}\n\n  arcee:\n    api_key: ${ARCEE_API_KEY}\n\n  cerebras:\n    api_key: ${CEREBRAS_API_KEY}\n    base_url: https://api.cerebras.ai/v1\n\n  fireworks:\n    api_key: ${FIREWORKS_API_KEY}\n    base_url: https://api.fireworks.ai/inference/v1\n\n  local:\n    base_url: http://localhost:11434/v1\n    api_key: ollama  # Ollama doesn't require a real key\n```\n\n## Adding a Custom Provider\n\nAny provider that implements the OpenAI chat completions API works:\n\n```yaml\nproviders:\n  my-custom:\n    api_key: ${MY_CUSTOM_API_KEY}\n    base_url: https://api.your-provider.com/v1\n```\n\nAdd the actual key to your `.env` file:\n\n```bash\necho \"MY_CUSTOM_API_KEY=<your-key-here>\" >> ~/.hermes/.env\nchmod 600 ~/.hermes/.env\n```\n\nThen use it:\n\n```bash\nhermes --provider my-custom --model their-model-name\n```\n\n## Model Aliases (Quick Switching)\n\nAdd aliases to switch models without typing full names:\n\n```yaml\nmodel_aliases:\n  fast:\n    model: cerebras/llama-3.3-70b\n    provider: cerebras\n  smart:\n    model: claude-opus-4-20250514\n    provider: anthropic\n  local:\n    model: nemotron:latest\n    provider: local\n```\n\nUse in chat:\n\n```\n/model fast      # Switch to Cerebras Llama 70B\n/model smart     # Switch to Claude Opus\n/model local     # Switch to local Ollama model\n```\n\n## Provider Comparison (What We Actually Use)\n\n| Provider | Speed | Cost | Best For |\n|----------|-------|------|----------|\n| Cerebras | 3000+ tok/s | Cheap | Fast inference, bulk tasks, coding |\n| Anthropic | ~100 tok/s | Premium | Complex reasoning, long context |\n| OpenRouter | Varies | Varies | Model variety, fallback provider |\n| Fireworks | Fast | Cheap | Embeddings, specialized models |\n| Ollama (local) | Varies | Free | Privacy, offline, experimenting |\n\n**Our setup:** Cerebras for speed, Anthropic for quality, Ollama for local models and embeddings.\n\n## Routing Cheat Sheet by Task Type\n\nUse these as opinionated defaults, then tune with [Part 20's cost-routing playbook](./part20-observability.md#cost-routing-playbook-the-one-that-actually-saves-money):\n\n| Task | First choice | Fallback (cheaper) | Fallback (fastest) |\n|------|--------------|--------------------|--------------------|\n| Daily conversation | Claude Sonnet 4.5 | GLM 5.1 | Cerebras Llama 70B |\n| Coding delegation | Claude Code via Sonnet 4.5 | OpenCode + Kimi K2.5 | OpenCode + Cerebras |\n| Long-context reads (>200K) | Gemini 2.5 Pro | Gemini 2.5 Flash | \u2014 |\n| Classification / triage | Gemini 2.5 Flash | Cerebras Qwen3 32B | Arcee AFM-4.5 |\n| Reasoning (math, planning) | GPT-5.4 | Claude Opus 4 | GLM 5.1 |\n| Current events / live search | xAI Grok 4 | Gemini with grounding | \u2014 |\n| Embeddings (LightRAG) | Qwen3-Embedding-8B (Fireworks) | nomic-embed-text (Ollama) | OpenAI `text-embedding-3-small` |\n| TTS (Telegram voice) | OpenAI TTS via Tool Gateway | Gemini 2.5 Flash TTS | Edge TTS (free) |\n| Vision | Gemini 2.5 Flash | GPT-4o | Claude Sonnet 4.5 |\n\n---\n\n## Cerebras Gotchas\n\nCerebras is fast but has quirks:\n\n1. **No system prompt caching.** Every request re-sends the full system prompt. Keep it short.\n2. **Rate limits are per-minute, not per-request.** Batch carefully.\n3. **Some models don't support tool calling.** Check before using as the main agent model.\n4. **Streaming is fast but chunky.** Large responses come in big bursts, not smooth streams.\n\nConfig:\n\n```yaml\nproviders:\n  cerebras:\n    api_key: ${CEREBRAS_API_KEY}\n    base_url: https://api.cerebras.ai/v1\n    # Models: llama-3.3-70b, llama-4-scout-17b-16e-instruct, qwen-3-32b\n```\n\n## Local Models (Ollama)\n\nRun models locally for free inference:\n\n```yaml\nproviders:\n  local:\n    base_url: http://localhost:11434/v1\n    api_key: ollama\n```\n\n**Best local models for Hermes:**\n- **Nemotron 30B** \u2014 good all-around, fits in 24GB VRAM\n- **Qwen 2.5 32B** \u2014 strong reasoning, needs 24GB+\n- **Llama 3.3 70B Q4** \u2014 best quality, needs 40GB+ VRAM\n\n**For embeddings (free):**\n\n```yaml\nembedding:\n  provider: local\n  model: nomic-embed-text\n  base_url: http://localhost:11434\n```\n\n## Switching at Runtime\n\n```\n/model cerebras/llama-3.3-70b    # Full model path\n/model fast                       # Alias\n/model                            # Show current model\n```\n\n## Auxiliary Models (Task-Specific Models)\n\nHermes supports dedicated models for eight task types. Each can have its own provider, model, base_url, api_key, and timeout.\n\n| Task Type | What It Does | Default |\n|-----------|-------------|---------|\n| `vision` | Image analysis, screenshot understanding | auto |\n| `web_extract` | Summarizing scraped web pages | auto |\n| `compression` | Context compression (summarizing old messages) | auto |\n| `session_search` | Searching past conversation transcripts | auto |\n| `approval` | Deciding whether to auto-approve tool calls | auto |\n| `skills_hub` | Skill discovery and matching | auto |\n| `mcp` | MCP tool routing | auto |\n| `flush_memories` | Memory consolidation and cleanup | auto |\n\nWhen set to `\"auto\"` (default), Hermes walks a provider resolution chain: OpenRouter \u2192 Nous Portal \u2192 Custom endpoint \u2192 etc.\n\n**Configure in `~/.hermes/config.yaml`:**\n\n```yaml\nauxiliary_models:\n  # Use a fast cheap model for compression \u2014 it's just summarizing\n  compression:\n    provider: cerebras\n    model: llama-3.3-70b\n    timeout: 30\n\n  # Use a vision-capable model for image analysis\n  vision:\n    provider: openrouter\n    model: google/gemini-2.5-flash\n    timeout: 60\n\n  # Use local model for session search (free, frequent calls)\n  session_search:\n    provider: local\n    model: nemotron:latest\n    base_url: http://localhost:11434/v1\n    api_key: ollama\n\n  # Everything else stays on auto\n  web_extract: auto\n  approval: auto\n  skills_hub: auto\n  mcp: auto\n  flush_memories: auto\n```\n\n**Why bother:**\n- **Compression** runs on every long session. Using a cheap/fast model saves money without affecting quality (summarization doesn't need Opus).\n- **Vision** needs a multimodal model. If your main model doesn't do images, set this to one that does.\n- **Session search** is called frequently. A local model makes it free.\n- **Approval** controls auto-execution. A fast model here means less latency on every tool call.\n\n## Fallback Chain\n\nConfigure automatic fallback if the primary model fails:\n\n```yaml\nmodel_fallback:\n  - provider: cerebras\n    model: llama-3.3-70b\n  - provider: openrouter\n    model: anthropic/claude-sonnet-4\n  - provider: local\n    model: nemotron:latest\n```\n\nHermes tries each in order. If Cerebras is down, it falls back to OpenRouter, then local.\n\n---\n\n*Don't lock yourself into one provider. The best model is the one that's fast enough and cheap enough for the task at hand.*\n", "part20-observability.md": "# Part 20: Observability & Cost Control \u2014 Langfuse, Helicone, /usage, Routing Playbooks\n\n*You can't optimize what you can't see. Hermes tracks tokens, latency, and errors natively, but once you're running across CLI + Telegram + Discord + cron + coding-agent delegations, you want a real tracing stack. This part sets up Langfuse, Helicone, or OpenTelemetry \u2192 Phoenix with one config block, then gives you the cost-routing playbook that dropped our test deployment from $34 to $3 per feature implementation.*\n\n---\n\n## The Three-Level Stack\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  Level 3 \u2014 Hosted tracing (Langfuse / Helicone / Phoenix)\u2502\n\u2502  Replayable traces, prompt versioning, evals            \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                            \u2191\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  Level 2 \u2014 Hermes internals (/usage, /status, dashboard)\u2502\n\u2502  Token counts, rate-limit headers, per-session cost     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                            \u2191\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  Level 1 \u2014 Logs (~/.hermes/logs/*, `hermes logs tail`)  \u2502\n\u2502  Raw events, tool invocations, errors                   \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\nYou always have Level 1 and 2. Level 3 is the force multiplier once you're spending more than $50/mo on LLM calls.\n\n---\n\n## Level 1 + 2 \u2014 What Ships With Hermes\n\n### `/usage`\n\n```\n/usage                              # Current session\n/usage 7d                           # Rolling 7-day window\n/usage --by-provider                # Breakdown\n/usage --by-skill                   # Which skills burn tokens\n/usage --by-gateway                 # CLI vs Telegram vs Discord\n```\n\nAs of v0.9.0 this now includes **rate-limit headers** captured from each provider \u2014 you can see \"how close am I to the 5M/min ceiling\" without digging into logs.\n\n### Dashboard Analytics\n\nThe [Web Dashboard](./part12-web-dashboard.md) has an Analytics tab with:\n\n- Cost by day / week / month\n- Tokens in vs out (streaming-aware)\n- Per-skill utilization (which ones actually earn their token cost)\n- Tool call distribution (are you really using all those MCPs?)\n- Error rates per provider (for failover tuning)\n\n### `hermes logs`\n\n```bash\nhermes logs tail -f                 # Live tail, all gateways\nhermes logs search \"TokenLimit\"     # Grep\nhermes logs export --since 7d       # JSONL for offline analysis\n```\n\nCombine with `jq` or load into DuckDB for ad-hoc cost analysis:\n\n```bash\nhermes logs export --since 30d --format jsonl \\\n  | duckdb -c \"SELECT gateway, SUM(tokens_out) FROM read_json_auto('/dev/stdin') GROUP BY 1 ORDER BY 2 DESC\"\n```\n\n---\n\n## Level 3 \u2014 Langfuse (Recommended Default)\n\nLangfuse is the \"everything in one place\" option: tracing, prompt management, evals, self-hostable. If you're not sure where to start, start here.\n\n### Setup (Hosted Cloud)\n\n```yaml\n# ~/.hermes/config.yaml\nobservability:\n  langfuse:\n    enabled: true\n    host: https://cloud.langfuse.com\n    public_key: ${LANGFUSE_PUBLIC_KEY}\n    secret_key: ${LANGFUSE_SECRET_KEY}\n    sample_rate: 1.0                # Reduce for very high volume\n    traced_tools:                    # Which tool calls to capture\n      - terminal\n      - github\n      - claude-code\n      - gemini-cli\n    redact_payloads: true            # Redacts before sending (matches your security.secrets.patterns)\n```\n\nGet the keys from https://cloud.langfuse.com \u2192 Settings \u2192 API Keys. Free tier covers most individual users.\n\n### Self-Hosted Langfuse\n\nFor privacy or compliance, one-liner on a VPS with Docker:\n\n```bash\ncurl -fsSL https://langfuse.com/docker-compose.yml -o langfuse.yml\ndocker compose -f langfuse.yml up -d\n```\n\nPoint `host:` at your domain. Hermes sends OTLP over HTTPS, so Caddy with Let's Encrypt just works.\n\n### What You See\n\nEach Hermes turn becomes a trace. Each trace has spans for:\n\n- `agent.turn` (root)\n  - `llm.call` (with prompt, completion, tokens, cost, latency)\n  - `tool.call` (each tool with args, result, duration)\n    - nested `llm.call` for sampling-enabled MCP servers\n  - `memory.search` (queries and hits)\n  - `skill.load` (which skills got pulled in)\n\nReplay any turn, inspect the exact prompt, compare with previous runs, eval completions against datasets. This is how you find the turn that spent $4 on \"how should I name this variable\".\n\n---\n\n## Level 3 \u2014 Helicone (Gateway-First, Zero Code)\n\nHelicone is the \"swap the base URL and ship\" option. You don't add a tracing SDK \u2014 you route your LLM traffic through a proxy that observes it.\n\n```yaml\nproviders:\n  anthropic:\n    api_key: ${ANTHROPIC_API_KEY}\n    base_url: https://anthropic.helicone.ai\n    headers:\n      Helicone-Auth: Bearer ${HELICONE_API_KEY}\n      Helicone-Property-Session: ${HERMES_SESSION_ID}\n      Helicone-Property-Skill: ${HERMES_ACTIVE_SKILL}\n\n  openai:\n    api_key: ${OPENAI_API_KEY}\n    base_url: https://oai.helicone.ai/v1\n    headers:\n      Helicone-Auth: Bearer ${HELICONE_API_KEY}\n      Helicone-Cache-Enabled: \"true\"   # Automatic prompt caching\n```\n\nHermes passes session ID and skill name as Helicone custom properties, so you can filter traces by skill/session in the Helicone UI. Cache hits (identical prompts) are free \u2014 this alone cuts bills noticeably for repetitive skills.\n\nPick Helicone over Langfuse when:\n\n- You want zero code-level integration\n- You want provider-level prompt caching for free\n- You mostly care about cost + latency dashboards, not prompt management\n\n---\n\n## Level 3 \u2014 OpenTelemetry \u2192 Phoenix (Standards-First)\n\nIf you already run OpenTelemetry (Grafana, Datadog, Honeycomb), wire Hermes into your existing pipeline:\n\n```yaml\nobservability:\n  otel:\n    enabled: true\n    endpoint: https://otel.yourdomain.com:4318\n    protocol: http/protobuf\n    headers:\n      authorization: Bearer ${OTEL_TOKEN}\n    attributes:\n      service.name: hermes-prod\n      deployment.environment: production\n```\n\nHermes emits `gen_ai.*` spans following the [OpenInference](https://github.com/Arize-ai/openinference) conventions. Point them at [Arize Phoenix](https://phoenix.arize.com) (self-hosted or cloud) for an LLM-specific view; or at your existing Grafana/Tempo for a \"one pane of glass\" view.\n\n---\n\n## Cost Routing Playbook (The One That Actually Saves Money)\n\n### Rule 1: Route by Task Complexity, Not Default\n\nMost Hermes cost bloat comes from using Claude Opus / GPT-5 for tasks Kimi / GLM / MiniMax would handle identically. Set up a **task-aware default**:\n\n```yaml\nmodel_routing:\n  default:\n    model: claude-sonnet-4-20250514\n    provider: anthropic\n  routes:\n    - match: { intent: [classification, extraction, triage, sum_under_500_tokens] }\n      model: gemini-2.5-flash\n      provider: openrouter\n    - match: { intent: long_context, tokens_gte: 150000 }\n      model: gemini-2.5-pro\n      provider: openrouter\n    - match: { intent: [write_code, refactor, debug], complexity: medium }\n      model: glm-5.1\n      provider: zai\n    - match: { intent: [write_code, refactor, debug], complexity: high }\n      model: claude-sonnet-4-20250514\n      provider: anthropic\n    - match: { intent: [reasoning, math], complexity: high }\n      model: gpt-5.4\n      provider: openai\n```\n\nHermes classifies intent via a tiny prompt (~100 tokens) and routes accordingly. Empirically:\n\n| Scenario | Naive default (Sonnet 4.5) | Routed | Savings |\n|----------|----------------------------|--------|---------|\n| Feature implementation (100 calls) | ~$34 | ~$3 (mostly Kimi) | 91% |\n| Long-doc summarization (10 calls, 200K each) | ~$42 | ~$4 (Gemini 2.5 Pro) | 90% |\n| Daily classification triage | ~$18/day | ~$1/day (Flash) | 94% |\n\n### Rule 2: Prompt Caching Is Free Money\n\nEvery stable chunk (system prompt, skill, SOUL.md, memory digest) should be cached:\n\n```yaml\nprompt_caching:\n  enabled: true\n  providers: [anthropic, openai, helicone]\n  cache_system_prompt: true          # Biggest win\n  cache_skills: true\n  cache_memory_digest: true\n  min_cache_tokens: 1024             # Anthropic's minimum\n```\n\nAnthropic's prompt caching discount is ~90% on cached reads. For a 5K-token system prompt used 100 times a day, that's a real $2\u20135 a day saved.\n\n### Rule 3: Use Fast Mode Surgically\n\n[Fast Mode](./part14-fast-mode-watchers.md) (`/fast`) costs more per token but reduces queue latency. Use it for:\n\n- Interactive CLI sessions where you're watching the output\n- Telegram conversations where the user is waiting\n- Real-time voice flows\n\nDon't use it for:\n\n- Cron / scheduled tasks\n- Nightly analysis jobs\n- Long bulk operations\n\n```yaml\nfast_mode:\n  defaults:\n    cli: on\n    telegram: on\n    discord: on\n    cron: off\n    webhooks: off\n  user_override: true                # User can toggle with /fast\n```\n\n### Rule 4: Context Is the Real Cost \u2014 Use `/compress`\n\nMost sessions' 100th turn costs 10x the 10th turn. [`/compress <topic>`](./part14-fast-mode-watchers.md#compress-topic--guided-compression) plus the pluggable context engine can cap per-turn cost:\n\n```yaml\ncompression:\n  auto:\n    enabled: true\n    at_tokens: 48000                 # Compress when session exceeds this\n    preserve:\n      - last_n_turns: 10\n      - tool_results_matching: \"error|ERROR|failed\"\n    topics_from: active_skill         # Use active skill name as compression topic\n```\n\n### Rule 5: Alert on Cost Anomalies\n\n```yaml\nalerts:\n  cost_spike:\n    window: 1h\n    threshold_usd: 5                 # Alert if > $5 in an hour\n    channel: telegram_private\n  token_anomaly:\n    window: 10m\n    threshold_tokens_per_turn: 30000\n    channel: telegram_private\n```\n\nCatches runaway loops (a skill stuck in a retry tornado) and prompt injection attempts (attacker trying to burn your tokens).\n\n---\n\n## Eval-Driven Regression Prevention\n\nOnce you have Langfuse, add a dataset + evals for your critical paths:\n\n```bash\n# One-time setup\nhermes evals init\nhermes evals dataset create telegram-support-flows\nhermes evals dataset add telegram-support-flows ~/.hermes/traces/support/*.json\n\n# Run on every release\nhermes evals run telegram-support-flows --model claude-sonnet-4-20250514\nhermes evals run telegram-support-flows --model glm-5.1     # Check if cheaper model still passes\nhermes evals compare\n```\n\nThis is how you confidently swap a $10/Mtok model for a $0.30/Mtok one \u2014 empirically, not by vibes.\n\n---\n\n## What's Next\n\n- [Part 19: Security Playbook](./part19-security-playbook.md) \u2014 set cost alerts as an injection-detection signal\n- [Part 17: MCP Servers](./part17-mcp-servers.md) \u2014 MCP sampling costs show up in traces too\n- [Part 14: Fast Mode](./part14-fast-mode-watchers.md) \u2014 the fast-mode toggle referenced above\n- [Part 6: Context Compression](./part6-context-compression.md) \u2014 the compression system that backs Rule 4\n"}