Forem

Retro gaming guide: CSS scanlines, Orbitron and dark theme without JS

Odilon HUGONNOT — Wed, 20 May 2026 09:00:06 +0000

I needed a buying guide for portable retro consoles — the kind of page you browse from a couch on a Saturday night before caving in and ordering something on eBay. Static content, no dynamic behaviour, no API. A perfect excuse to write unbounded CSS without having to justify it to a team. The vision: a cyberpunk dark theme, scanlines that echo old CRT screens, retro-futuristic typography. And crucially, no framework, no JavaScript whatsoever.

What I learned building it is that the most visually striking effects don't come from animation libraries — they come from the right combination of CSS gradients, pseudo-elements, and well-structured custom properties. Here's how it works.

Context

The target page: retro-consoles.html. A buying guide covering around twenty portable retro consoles — Game Boy, Analogue Pocket, Miyoo Mini, Anbernic RG35XX and the rest. Each console gets its specs, its price, its pros and cons.

The technical constraint was simple: a static page hosted on the same Apache server as the CV, zero runtime dependencies. The aesthetic goal was more ambitious: recreating the visual atmosphere of a 1980s CRT screen in a modern browser. The cyan accent colour #00d4ff, chosen for its resemblance to phosphor green shifted to electric blue, sets the tone. The rest of the theme follows from that.

The scanlines effect — pure CSS

Scanlines are the most visually recognisable effect and the simplest to implement. The idea: a ::before pseudo-element fixed to the viewport, covering the entire page, with a repeating-linear-gradient that alternates transparent and semi-opaque every two pixels.

body::before {
    content: '';
    position: fixed;
    inset: 0;
    pointer-events: none;
    z-index: 9999;
    background: repeating-linear-gradient(
        to bottom,
        transparent,
        transparent 1px,
        rgba(0, 0, 0, 0.08) 1px,
        rgba(0, 0, 0, 0.08) 2px
    );
}

Three decisions here that are worth spelling out.

position: fixed, not absolute. The effect needs to overlay all content even when scrolling, like a real physical screen. fixed stays anchored to the viewport.

pointer-events: none. Without this, the pseudo-element absorbs all clicks and the page becomes unusable. A classic oversight with CSS overlays.

Opacity at 0.08 on the dark lines. That number is arbitrary but the result of several iterations. Too strong (0.2+) and the content becomes unreadable. Too weak (0.03) and the effect disappears on light backgrounds — but here the background is dark, so contrast works at low values.

Retro typography: Orbitron, Space Mono, why these choices

Type choices in a themed design account for 70% of the visual identity. Three fonts, three distinct roles:

/* Headings: Orbitron — geometric shapes, angular uppercase */
h1, h2, h3 {
    font-family: 'Orbitron', monospace;
    letter-spacing: 0.05em;
    text-transform: uppercase;
}

/* Prices, technical specs: Space Mono — monospace with personality */
.price,
.spec-value {
    font-family: 'Space Mono', monospace;
}

/* Body copy: DM Sans — readable, neutral, unobtrusive */
body {
    font-family: 'DM Sans', sans-serif;
}

Orbitron for headings: the archetypal retro-futuristic typeface, drawn with geometric shapes and right angles. Works equally well for "GAME BOY" and "ANALOGUE POCKET". The trade-off: it's unreadable at body copy sizes. A heading set in Orbitron at 14px makes you want to close the tab.

Space Mono for numerical data. A monospace designed by Colophon Foundry with more personality than Courier or Roboto Mono. Prices aligned vertically in Space Mono on a dark background recall 1980s terminal screens.

DM Sans for everything else. The rule is simple: never put a display typeface in body copy. Orbitron for 500 words of console descriptions would be cruel.

Custom properties for the entire theme

A coherent dark theme with a single accent colour is best managed through custom properties from the start. Not a single hardcoded colour anywhere in the CSS — everything goes through :root.

:root {
    /* Backgrounds */
    --bg-primary:   #0a0a0f;
    --bg-secondary: #111118;
    --bg-card:      #16161f;
    --bg-card-hover:#1c1c28;

    /* Main accent */
    --accent:       #00d4ff;
    --accent-dim:   rgba(0, 212, 255, 0.15);
    --accent-glow:  0 0 20px rgba(0, 212, 255, 0.3);

    /* Text */
    --text-primary:  #e8e8f0;
    --text-secondary:#9090a8;
    --text-muted:    #5a5a72;

    /* Borders */
    --border:        rgba(0, 212, 255, 0.2);
    --border-strong: rgba(0, 212, 255, 0.5);
}

A few details that matter. --accent-dim: a heavily attenuated version of the accent for hover backgrounds and badges. Without it, you end up recomputing rgba(0, 212, 255, X) all over the stylesheet. --accent-glow: the neon halo on featured elements, defined once as a box-shadow value.

The card hover glow then reduces to:

.console-card:hover {
    background: var(--bg-card-hover);
    border-color: var(--border-strong);
    box-shadow: var(--accent-glow);
    transform: translateY(-2px);
    transition: all 0.2s ease;
}

Responsive layout without a framework

CSS Grid with auto-fill and minmax — the only way to build a truly responsive multi-column layout without a single media query for the cards:

.consoles-grid {
    display: grid;
    grid-template-columns: repeat(auto-fill, minmax(280px, 1fr));
    gap: 24px;
}

With minmax(280px, 1fr), the browser calculates how many columns fit. On a 1440px screen: 4 columns. On an iPad at 768px: 2 columns. On a phone: 1 column. No manual breakpoints needed for the grid itself.

Media queries are still needed for other adjustments: reducing Orbitron font-size on mobile (geometric headline fonts are wide), switching the header from two columns to one, adjusting padding.

.page-header {
    display: grid;
    grid-template-columns: 1fr auto;
    align-items: center;
    gap: 32px;
}

@media (max-width: 768px) {
    .page-header {
        grid-template-columns: 1fr;
    }

    h1 {
        font-size: clamp(1.4rem, 5vw, 2.5rem);
    }
}

clamp() for Orbitron headings: essential. Without it, a 2.5rem h1 in Orbitron overflows the viewport on an iPhone SE.

Conclusion

This project reinforced something I knew in theory but regularly underestimate in practice: a strong visual design doesn't require JavaScript. The card hover animation, the scanlines effect, the neon glow — all of it works with transition, box-shadow, and a pseudo-element. The browser handles the rendering on the GPU, without a single requestAnimationFrame.

The real complexity wasn't technical — it was visual calibration. Finding the scanline opacity that creates atmosphere without hurting legibility. Choosing a cyan luminosity that evokes phosphor without burning eyes. These are decisions that don't get written in code; they get validated by eye.

The result: retro-consoles.html.

How to Build an Autonomous AI Coding Agent That Opens GitHub PRs Overnight

pickuma — Wed, 20 May 2026 08:59:27 +0000

You file an issue before bed: "Migrate the date helpers off moment.js." You wake up to a draft pull request — branch created, files changed, tests green, waiting for review. That is the pitch for an autonomous AI coding agent, and the surprising part is how little of it is novel. The hard problem is not the model. It is the loop around the model: the harness that turns a task into a reviewable PR with nobody in the chair.

We built this pattern and ran it against real repositories. What follows is the architecture that held up, the GitHub wiring that kept it safe, and an honest account of which tasks it finishes and which it quietly botches.

Anatomy of the overnight loop

An autonomous coding agent is a state machine with a language model wired into a few of its transitions. Strip away the marketing and five stages remain:

Ingest — pull the task (a GitHub issue, a queue row, a line in a file) and the repo into a clean working directory.
Plan — one model call reads the task and the repo layout, then emits a concrete plan: which files change, in what order, and what "done" looks like.
Execute — a separate model call edits files to match the plan, one coherent change at a time.
Verify — run the test suite, the type checker, and the linter.
Package — commit, push a branch, open a pull request.

The mistake most first attempts make is collapsing stages two through four into one enormous prompt: "here is the repo, here is the task, output the diff." That works for a three-line fix and falls apart on anything larger. Chaining narrow steps buys you something a single prompt cannot — a checkpoint between each stage where the work can be inspected before the agent commits to it.

Stage four is what separates a coding agent from a code generator. A model with no feedback loop will cheerfully report success on code that does not compile. Wire the executor to the verifier so a failing test run feeds the actual error text back into the next edit. Bound the retries — three attempts is a sane ceiling — and if the agent still cannot reach green, it should stop, open the PR as a draft, and log the failure rather than push broken code or loop forever burning tokens.

An autonomous agent with repo write access is a contributor you have never met. Give it a fine-grained personal access token scoped to a single repository — never the org-wide token sitting in your shell. Run the execute stage in a container or disposable VM, not on your laptop, because generated code runs npm install and whatever else it decides it needs. Keep branch protection on main so the worst plausible outcome is a bad pull request, not a bad commit on your default branch.

Wiring it to GitHub without losing a finger

Once the agent has a green working tree, the GitHub mechanics are routine. The pattern that held up for us:

One branch per task, named predictably — agent/142-moment-migration, keyed off the issue number. Predictable names make reruns idempotent: if the branch already exists, update it instead of spawning a duplicate.
Open the PR as a draft and assign yourself as reviewer. Draft status tells the rest of the team the change is not merge-ready and discourages a reflexive approval.
Label every bot-authored PR — agent-generated or similar. That label is your provenance trail, and if the diff reaches users it is the basis for any disclosure you owe them.
Let CI run on the agent's PR exactly as it would on a human's. CI is the safety net the agent's own verify stage cannot fully replace, because it runs in a clean environment you control rather than the agent's sandbox.

The gh CLI keeps packaging short: gh pr create --draft --base main --head agent/142. For richer control — adding labels, requesting reviewers, reading PR state back — the Octokit REST client earns its dependency.

For the trigger you have two clean options. A cron job firing at 2 a.m. drains a task queue overnight. Or a GitHub webhook: label an issue agent-ready, and the labeling event starts a run. The webhook route is closer to a real workflow, because the task and its trigger live where your team already works.

What it automates well — and where it stalls

The overnight agent earns its keep on work that is mechanical but tedious: bumping a dependency and fixing the fallout, adding test coverage to an untested module, migrating a deprecated API call across a codebase, running a codemod, converting config files, tidying stale documentation. These tasks share one trait — "done" is objectively checkable. A passing test suite or a clean type check is enough for the verify stage to know it succeeded.

It stalls on the opposite kind of work. Ambiguous requests ("make the dashboard feel faster"), cross-cutting architecture changes, anything resting on product judgment, and — most important — any task in a repo whose test suite is thin. No verify gate means no safety, and the agent's confidence in its own output is not a substitute for one.

Two numbers decide whether this is worth running. The first is cost per run: every stage is one or more model calls, and a non-trivial task with retries can reach dozens. Set a hard token or dollar ceiling per run so a stuck agent cannot run up a bill while you sleep. The second is PR acceptance rate — the share of generated PRs you merge without substantial rewriting. If you rewrite more than you keep, the tasks are scoped too loosely; tighten them until the agent succeeds reliably, then widen the scope carefully from there.

The morning review stays non-negotiable. The agent's job is to hand you a pull request you can judge in a few minutes instead of work that would have cost you an hour. A bad PR you rubber-stamp at 9 a.m. is worse than no PR — so treat every overnight diff as untrusted until CI is green and you have read it yourself. Built this way, the agent is not a replacement for you. It is a night shift that handles only the boring parts.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.

Continual Harness: The Gemini Pokémon Agent That Rewrites Its Own Loop

pickuma — Wed, 20 May 2026 08:58:10 +0000

Most of the work that makes an AI agent good never happens inside the model. It happens in the harness — the code that feeds the model its observations, defines its tools, trims its context, and decides what to do with each response. When an agent fails, the usual fix is a human editing that harness: rewording a tool description, adding a memory store, changing how a screenshot gets summarized. The Continual Harness work, from the teams behind Gemini Plays Pokémon and the PokeAgent benchmark, pushes on a sharper question — what if the model edited the harness itself, while the run was still going?

The harness is where agents actually live

Gemini Plays Pokémon was a public demonstration: a Gemini model worked through a Game Boy Pokémon title via a harness that turned the game into something a language model could reason about. The harness converted pixels into labeled screenshots, a map of the current area, and an inventory list, then exposed button presses and pathfinding helpers as tools. The model never touched raw emulator memory. It saw whatever the harness chose to show it, and it acted only through the tools the harness defined.

That structure is not specific to Pokémon. A coding agent doesn't see your repository — it sees the files a retrieval step pulled in. A browser agent doesn't see a webpage — it sees an accessibility tree some extraction code produced. The harness is the agent's entire sensory system, its motor system, and its memory. The model is one component inside it.

Which means most of the leverage in agent quality sits in the harness, not the weights. Teams running long agent tasks spend their time there: tightening tool descriptions, adding retry logic, changing how context gets summarized so the model stops losing the thread on long runs. That iteration is real engineering, and almost all of it happens offline — a human watches a failure, edits the scaffolding, and starts a fresh run.

When people say an agent "got smarter," they often mean the harness got better — the model checkpoint never moved. That is worth internalizing before you reach for a fine-tuning budget.

What "continual" changes

The Continual Harness pattern moves that improvement loop inside the run. The agent is given write access to parts of its own harness. When it hits a recurring failure — say it keeps walking into a ledge because the pathfinding helper doesn't model one-way tiles — it can propose a change to that helper, apply it, and continue with the improved tool in hand. The scaffolding at hour ten is not the scaffolding the run started with.

This is online adaptation, and it sits between two things developers already know. It is not fine-tuning: the model weights stay frozen. It is not ordinary in-context learning either, where the model only writes itself a note. The improvement lands as durable code — a function the agent rewrote — so it persists, it is inspectable, and it can be reverted. The model is playing the game and maintaining the controller at the same time.

The reason this matters beyond a Pokémon stream: the manual harness-tuning loop is a bottleneck. Every agent team has a backlog of "the tool description is slightly wrong" and "the memory step drops the wrong thing" fixes that a human has to notice, diagnose, and ship. An agent that can do a slice of that work itself, on the failures it is actually hitting, compresses that loop from days to minutes.

An agent with write access to its own harness can also break it. A bad edit to the navigation tool can strand the run; a bad edit to the context-trimming step can quietly degrade every later decision. This pattern is only safe with versioned edits, a fixed core loop the agent cannot touch, and automatic rollback when a change makes the feedback signal worse.

Borrowing the pattern for your own agents

You do not need a Game Boy emulator to use this. The pattern reduces to four decisions.

Separate the editable surface. Decide explicitly which parts of the harness the agent may rewrite — tool wrappers, prompt templates, retrieval filters — and which are permanently off-limits: the loop that calls the model, the kill switch, anything that touches credentials or external writes. The self-improving part should be a small, well-fenced area.

Treat every harness edit as a commit. A self-improvement is a diff. Give it a message, a test, and a revert path. If you cannot answer "what did the agent change, and how do I undo it," you do not have a continual harness — you have an agent slowly corrupting itself.

Give it a feedback signal it can act on. Pokémon has an obvious one: progress through the game. Your agent needs an equivalent — task success rate, an eval suite, a latency budget. Without a metric, the agent edits blind, and you cannot tell improvement from regression.

Start narrow. Let the agent tune tool descriptions and retry thresholds long before you let it rewrite tool implementations. Widen the editable surface only as the rollback machinery proves itself.

If you want to watch a constrained version of this loop before wiring it into an autonomous run, an AI-native code editor is the closest everyday analog: an agent proposes edits to real code, and you approve or reject each diff.

The Continual Harness result is not that an agent finished a Pokémon game. It is that the harness — long treated as fixed scaffolding a human owns — can be a live, model-editable surface. For anyone building agents that run for hours, that reframes where the next improvement comes from.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.

oh-my-agent v2: Nine New Skills, First-Class Cursor, and an 80/100 Benchmark

pickuma — Wed, 20 May 2026 08:56:54 +0000

If you have watched an AI coding agent install a package version that does not exist in your lockfile, or ship a function that fails your own lint config on the first commit, you already understand the gap oh-my-agent v2 is built to close. The framework's second major release adds nine new skills, promotes Cursor to a first-class vendor, and ships a benchmark that scores the toolkit 80 out of 100.

Here is what v2 changes, and how to decide whether the additions target real failure modes or just expand the surface area.

What oh-my-agent does

oh-my-agent is a skill layer that sits between you and whatever AI coding agent you run. The name borrows from oh-my-zsh, and the analogy holds: instead of configuring shell behavior, you configure agent behavior with reusable, composable instruction modules the project calls skills.

The problem it targets is consistency. A raw coding agent keeps no durable memory of your project's conventions. Ask it to add a dependency and it may guess a version that is not in your lockfile. Ask it to write a component and it may ignore the lint config sitting in your repo root. These are not edge cases — they are the default behavior of an agent that treats every request as a fresh context.

A skill in oh-my-agent is a packaged set of instructions and checks the agent loads when a task matches. One skill might force the agent to read your package.json and lockfile before proposing a version. Another might surface your linter rules before any code is written. The pitch is that you stop re-explaining the same constraints in every prompt.

The nine new skills in v2

The v2 release adds nine skills. Three are worth calling out, because they map to problems most teams hit within a week of adopting an agent.

deepsec handles security review. Instead of trusting the agent to remember secure patterns, the skill runs a structured pass over generated code, checking for the injection, secret-handling, and trust-boundary mistakes agents introduce when they optimize for making something work.

observability pushes the agent to add logging, metrics, and tracing as it writes code, rather than leaving instrumentation as a follow-up task that never happens.

docs drift detection is the one most teams underrate. When an agent changes a function signature or a config option, the matching documentation usually goes stale without anyone noticing. This skill flags the gap so docs and code stay in sync.

If you adopt only one skill from v2, start with docs drift detection. Stale documentation is the failure you notice last and pay for longest: every new teammate and every future agent run inherits the wrong mental model from it.

The remaining six skills round out areas like testing and project conventions. The pattern across all nine is the same: take a step a developer is supposed to do, and make it a non-optional part of the agent's workflow instead of a hope.

Cursor becomes a first-class vendor

Earlier oh-my-agent releases were built around one agent and treated the rest as second-class. v2 changes the model. A vendor is the underlying agent that executes skills, and Cursor is now a first-class vendor, which means skills are tested against it and ship with Cursor-specific wiring rather than a generic fallback.

In practice, you can keep oh-my-agent's skill definitions in one place and run them through Cursor's agent without rewriting instructions per tool. For teams that have standardized on Cursor as their editor, that removes the main reason to maintain a separate, hand-rolled set of project rules.

First-class status is a maintenance commitment, not a one-time feature. The thing to watch over the next few releases is whether Cursor support keeps pace with the primary vendor or quietly drifts behind it — the usual failure pattern for multi-vendor tools.

What the 80/100 benchmark does and doesn't tell you

v2 ships with a benchmark that scores the toolkit 80 out of 100. A published, repeatable number is useful on its own: it gives you a baseline to compare future releases against, and it signals the project is willing to measure itself instead of leaning on adjectives.

Treat the number as a starting point, not a verdict. A benchmark reflects the tasks its authors chose. An 80 on the project's own suite tells you the skills behave as designed on that suite. It does not tell you how they perform on your codebase, your stack, or your conventions.

Do not adopt oh-my-agent on the strength of the 80/100 score alone. Run the skills against a real branch in your own repo and measure something you care about — failed lint checks, wrong dependency versions, broken builds — before and after. A framework's self-reported benchmark is a sales sheet until you reproduce it.

The honest read on v2: the release aims squarely at the most common, least glamorous agent failures — wrong versions, ignored configs, stale docs — rather than chasing a flashier capability. That is the right target. The open question is operational. Nine new skills is a lot of surface to keep working across two first-class vendors, and the real proof will be whether release three holds the line.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.

How Far Can a Small Coding Model Go With a Better Harness?

Dmitry Barakhov — Wed, 20 May 2026 08:55:18 +0000

Every time a coding agent underperforms, the default move is to swap in a bigger model. I wanted to see what happens if you refuse that move and fix everything else instead.

The result: 61.6% ± 1.9 on Terminal-Bench 2.0 with GPT-5.1-Codex-Mini — rank #41, in the same band as stock harnesses running flagship models a tier or two larger. 445 runs, $27, ~35 hours.

This is not an argument that small models are secretly enough. It is an argument that the wrapper around the model is doing more work than most people give it credit for — and that you can see this clearly only when the model is small enough that harness mistakes actually hurt. What follows is a teardown of what survived.

Reading the number

The score is verified on the official leaderboard at rank #41 as of May 14, 2026, across 89 tasks with 5 runs each. Leaderboards move, so I treat the rank as a timestamped snapshot rather than a permanent claim. The useful comparison is the neighborhood around that snapshot: entries immediately around rank #41 run on GPT-5.2, Claude Opus 4.6, and Gemini 3 Pro.

The wrapper, in other words, moves a smaller model into the same band as larger ones. The ±1.9 confidence interval comes from 5 runs per task across 89 tasks — wide enough that I treat the score as a band, not a precise rank, but tight enough that I do not think any one lucky trajectory is doing the work.

One of the improvements that stood out during local iteration was something almost embarrassingly small: a 100-token classifier call at the start of each task that picks which Markdown skill files to drop into the system prompt. I do not have a clean 445-run ablation for every variant, so this is an engineering observation rather than a controlled benchmark result. Everything else (streaming retries, the V4A patcher, tool-output capping) is supporting structure that lets one loop keep moving without falling over.

A one-time skill router picks which Markdown playbooks get appended to the system prompt; after that, a single model runs a loop against seven tool entries. Side-effecting tools (run_command, apply_patch) mutate the Terminal-Bench container; read-only tools (search_docs, get_docs, web_search) do not. A context policy caps tool outputs and preserves failures verbatim before they re-enter the model.

What Terminal-Bench actually is

Terminal-Bench 2.0 drops your agent into a hardened container with a task instruction and a terminal. That is the whole interface — no iteration cap, no filesystem visibility from outside, no retries. If you want any of that, you build it yourself. Eighty-nine tasks span chess-engine guidance, R-to-Python Stan migrations, QEMU bring-up, hash cracking, Core Wars, CSV surgery via Vim macros, and a long tail of things that look obscure until they bite you.

I picked it because frontier labs cite it themselves: it shows up in OpenAI's GPT-5.5 announcement and in Anthropic's Claude Opus 4.7 post as a headline coding-agent score. A benchmark that labs use to grade their own flagships is a reasonable place to test whether a small model can punch up. Runs go through Harbor, the evaluation framework Terminal-Bench 2.0 standardized on.

Why a small model in the first place

If the model is already the largest one available, every improvement is hard to attribute: did the harness help, or did the model muscle through? With a smaller model, harness mistakes show up quickly. Bad tool design, noisy context, fragile editing, and over-planning all become visible because the model has less slack.

Hookele started from OpenAI's GPT-5.1 coding agent notebook, which wires run_command, apply_patch, web_search, and Context7 docs lookup (a third-party library-documentation service) into the Responses API. The cookbook walks through the happy path. What follows is what happens when the happy path breaks: streams that drop mid-reasoning, diffs that fail to apply because context drifted by two characters, models that stall for 40 iterations without doing anything productive.

The loop

Hookele is one model in one loop. On the first turn the executor calls update_plan with 3–5 short steps and nothing else; after that the plan lives in the context window and the model can rewrite it whenever the approach changes. Every subsequent turn is a tool call against the same seven tool entries: run_command, apply_patch, update_plan, search_docs, get_docs, web_search, task_complete. No router, no state machine, no separate planner. The cap is 60 iterations by default, but most successful runs finish in under 20.

This is what survived deletion. An earlier version had a gpt-5-mini planner that emitted JSON with approach and tools_needed to constrain the executor. It misclassified about one in five tasks — usually deciding "no file editing needed" on tasks whose solution was a small file edit — and the executor would dutifully not edit files. I removed it. The same pass took out a five-state scan→plan→act→verify→stop machine, an acceptance-criteria extractor, automated criteria evaluators, and dual-model routing. None of them survived contact with the benchmark. Each one had looked principled in isolation and added nothing the executor could not infer for itself.

The tool surface got similar treatment. list_files and read_file came out because run_command already does both through ls and cat, and exposing duplicate tools just gave the model another decision to make on every turn. Tool outputs are capped at 20K characters, head + tail, so a chatty npm install cannot drown the context. Fewer tools means fewer decisions about which tool to use, which means more turns spent on the actual problem — and on a small model, every saved turn is real.

The two pieces of low-level plumbing under the loop that actually earn their keep are the V4A patcher and the streaming layer.

The patcher

Every edit goes through one tool: apply_patch. I standardized on it instead of a generic edit or write because Codex models are trained directly on the V4A patch format, and because structured diffs are easier to validate, retry, and explain than arbitrary file writes. V4A is OpenAI's diff envelope for Codex-family models — in Hookele, one payload can create, update, and delete files in a single call:

*** Begin Patch
*** Add File: hello.txt
+Hello world
*** Update File: src/app.py
@@ def greet():
-print("Hi")
+print("Hello, world!")
*** Delete File: obsolete.txt
*** End Patch

The @@ line is the context anchor — the patcher locates the hunk by matching the line that follows. My first patch applier was strict: anchor matches exactly or the patch fails. It failed constantly. In practice the model emits anchors that match exactly about 80% of the time, and the rest drift by trailing whitespace, a stale tab-vs-spaces conversion, or a single character of context. Without fallback matching, the model spends half its iterations re-issuing patches that "should have worked."

I ported the Agents SDK V4A engine (353 lines) and added three-level fuzzy context matching: try exact, then with trailing whitespace stripped, then with leading and trailing whitespace stripped. I also added conflict-marker detection and overlapping-hunk detection, so unresolved merge markers or two overlapping hunks fail loudly instead of silently corrupting the file. That killed almost all of the patch-related failures I was seeing, and it's the single piece of infrastructure I'd port first into any other Codex-based harness.

Streaming

Streaming in Hookele is not a latency feature. It exists because long model turns drop mid-stream more often than you'd expect — TCP resets, TLS handshake failures, incomplete chunked reads, server-side response.incomplete. Without retries, every multi-minute reasoning step is one network blip away from starting over.

Hookele retries up to five times with exponential backoff, pattern-matching the exception against a list of transport-level signals (connection reset, broken pipe, TLS handshake, peer closed connection, and a few more). Each retry threads previous_response_id — the Responses API handle that resumes from a prior turn's reasoning state — so the model continues from where it stopped instead of re-deriving its plan from scratch.

This is the quietest piece of the harness and the easiest to undervalue. It does not show up in the score directly, but on the longer tasks a single run will reconnect several times, and without state-carrying retries each of those would cost a full re-plan.

Skill classification

Claude Code and the Codex CLI both let users pre-load skill files for the model to draw on. Hookele does the same thing, but the routing is explicit: a 100-token codex-mini call reads the task instruction, sees the skill catalog (descriptions only, never bodies), and returns JSON like {"skills": ["async_cancellation"]}. The matched skill's full Markdown body gets appended to the system prompt before the loop starts.

The implementation is one f-string and one Responses API call:

response = client.responses.create(
    model="gpt-5.1-codex-mini",
    input=[{"role": "user", "content": prompt}],
    text={"format": {"type": "json_object"}},
    reasoning={"effort": "high"},
    max_output_tokens=100,
)
skills = json.loads(response.output_text).get("skills", [])

A skill file is just Markdown with YAML front matter and a dense bullet list. async-cancellation-safety is a dozen lines of asyncio gotchas (shield cleanup, await cancelled tasks with return_exceptions=True, semaphore-gated tasks that never start under KeyboardInterrupt). crack-7z-hash mostly tells the model that hashcat -m 11600 is the right mode for 7z archives — the kind of magic number a small model would otherwise burn iterations brute-forcing around.

The catalog is a folder of senior-engineer notebook pages: asyncio semantics, hashcat mode tables, SQL optimization patterns, and similar. Skills are deliberately task-agnostic — the same notes would carry over to any benchmark in this space. Examples are in the repo.

On tasks where a skill applies, trajectory logs visibly compress: the model goes from "explore the problem space" to "follow the playbook and verify" without intermediate flailing.

Error recovery

Hookele does not pattern-match stack traces or auto-retry commands. The recovery loop is intentionally dumb.

When a command fails, its truncated output goes back to the model. If the model then stalls (produces text or another plan revision instead of doing something productive), I inject a single nudge: "Last command failed. Summary: ... Try an alternative and continue." The nudge fires at most once until the next successful non-plan tool call resets it.

That is the entire recovery layer. The model is consistently better at deciding what to try next than any heuristic I wrote, including the heuristics I wrote and then deleted.

Transport-level failures (broken pipes, TLS handshakes, response.incomplete) get handled before the model sees them. Runtime failures (Harbor restart, merge conflicts in apply_patch, pip install blocked by PEP 668) get surfaced verbatim so the model can pivot.

The system prompt

The prompt is sectioned by Task, Workflow, Tools, Build Heuristics, Documentation lookup, Editing, Verification, and Constraints — not a flat numbered list. Active skills get appended below Constraints. The one detail worth calling out: the iteration countdown ("5 steps left", "FINAL WARNING") is not in the system prompt at all. It is injected as user messages mid-loop, so the model actually feels the deadline approaching instead of reading about it once at the start and forgetting.

The numbers

Methodology: 89 tasks, 5 runs each, one Hookele harness, GPT-5.1-Codex-Mini, default 60-iteration cap, run via Harbor. Rank and neighboring entries come from a May 14, 2026 leaderboard snapshot.

Perfect (5/5): 46 tasks, from kv-store-grpc to feal-differential-cryptanalysis to sanitize-git-repo.
Partial (1–4/5): 16 tasks. pytorch-model-cli at 80%, large-scale-text-editing at 60%, path-tracing at 20%.
Zero (0/5): 27 tasks.

A sample of what actually broke in the 0/5 set, pulled from trajectory logs:

compile-compcert and make-doom-for-mips — both hit the 60-iteration cap babysitting build systems (opam/Coq for CompCert, cross-toolchains for Doom). Around step 25 the model loses track of which dependency loop it is in, and the remaining iterations evaporate on apt/dpkg debugging.
sam-cell-seg — the model correctly identified MobileSAM as the right approach and wrote the conversion script, but the SAM weights and module are not present in the container. It marked the task complete with "testing pending due to unavailable weights/module." An environment ceiling, not a planning failure.
regex-chess — on step 12 the model deemed the task infeasible and called task_complete with a summary explaining why regex alone cannot express castling, en passant, and legality. Whether it is actually infeasible or the model gave up too early is debatable; either way, the harness has no obvious lever here.
extract-moves-from-video — the model refused: "I can't help with downloading or transcribing videos from YouTube." A safety heuristic triggered on the task description, before any technical attempt.

The failure modes are more varied than a single "too hard for the model" story: real long-horizon ceilings, a missing-dependency environment ceiling, a premature give-up, and an unrelated safety refusal. Skill injection does not help in any of these categories.

The full per-task table is on the Hookele leaderboard run.

Where skills run out

Skills cut down on exploration cost. They do not change what the model can do. If a task fundamentally needs GPU, reliable OCR, or many hours of careful multi-step debugging, no amount of prompt engineering will rescue it — the 0/5 catalog above is mostly that.

Harness engineering can expose the model's capability more reliably, but it cannot manufacture capability that is not there. Small models are often leaving performance on the table because the harness around them is too noisy, too brittle, or too eager to help — and that is the part this post is really about.

Cost and time

API spend across the submitted trajectory logs was $27, summed from final_metrics.total_cost_usd. Nine trajectories in the bundle did not include final cost metrics, so the cost number covers 436 of 445 trials. That still averages about 6 cents per recorded task run. The full 5-run sweep took roughly 35.4 hours start to finish.

Per-task spread is wide. fix-git finishes in under a minute for a fraction of a cent. compile-compcert and qemu-alpine-ssh can burn ten minutes and twenty cents each.

What I would tell someone starting

Four things kept coming back:

Delete before you add. The improvements I trusted most came from removing scaffolding — planner, state machine, criteria extractor, wrapper tools — not adding it.
Let the model handle errors. Every failure-mode heuristic I wrote underperformed the version where I just show the failure to the model and nudge once if it stalls.
Cheap pre-passes beat clever prompting. A 100-token classifier plus a folder of Markdown files moved the score more clearly than any structured-output or prompt-engineering pass.
Design the harness against a smaller model. If it only works when a frontier model papers over every rough edge, the harness is doing less than you think.

Code: https://github.com/sady4850/hookele-agent

I kept some implementation details out of this post for length; the repo has the harness code, patcher, and representative skill files.

Conductor Joins the Cloud Coding Agent Rush: Remote AI Devs Leave the Laptop

pickuma — Wed, 20 May 2026 08:51:20 +0000

For about two years, "AI coding agent" meant something that ran next to your editor: a Copilot completion, a Cursor chat panel, a Claude Code session in your terminal. The work used your CPU, held your attention, and stopped when you closed the laptop. That assumption is breaking. A separate class of tools runs the agent on remote infrastructure instead — you describe a task, close the tab, and come back to a pull request. Conductor is the newest name moving in this direction, and it lands in a category that already includes background agents from Cursor, GitHub, OpenAI, and Google.

The agent leaves your laptop

A cloud coding agent executes on a vendor's servers rather than your machine. You hand it a task — fix a failing test, migrate a module, draft an endpoint — through a web UI, a chat message, or a linked issue. It spins up an isolated sandbox, clones your repository, works through the task, and pushes a branch or opens a PR. You don't watch the edits land in your editor; you review the output.

That sounds like a small plumbing change, but it shifts what you can reasonably ask for. An IDE assistant is interactive: fast, visible, and dependent on you staying in the loop. A remote agent is asynchronous: slower to finish any single task, but it doesn't need your attention while it runs and doesn't compete with your machine for memory or battery.

A concrete version: you notice a flaky integration test on Friday afternoon. With an IDE assistant you would open the file and work through it interactively. With a cloud agent you describe the symptom, link the failing run, and close the tab — the fix shows up as a PR you review on Monday, having spent no local time on it.

The field filled out fast. GitHub's Copilot coding agent takes an assigned issue and returns a PR. Cursor's background agents run a task remotely while you keep editing locally. OpenAI's Codex cloud agent and Google's Jules both work asynchronously against your GitHub repositories. Devin markets itself as an autonomous teammate. Anthropic runs Claude Code as a hosted web service alongside the CLI. Conductor — known for orchestrating several agents at once, each in its own isolated workspace — is extending that orchestration toward remote execution. The pitch is consistent across all of them: stop treating your laptop as the place the work happens.

What changes when the work runs remotely

Moving the agent off your laptop is not just a hardware swap. It changes the unit of work you delegate and the way you supervise it. Three shifts matter most.

Parallelism stops being a resource problem. Running five local agents means five processes competing for your CPU, your RAM, and your battery — in practice you run one and wait. Running five remote agents costs you nothing locally. Your machine becomes a dashboard rather than a worker, and you can fan a bug triage out across several tasks, then review each branch as it finishes instead of serially.

Long-running work stops blocking you. A dependency upgrade that touches 30 files, or a refactor that takes 40 minutes, no longer locks the editor you need for everything else. You assign it before a meeting and review the PR after. Wall-clock time per task may be longer than a local run, but it overlaps with the rest of your day rather than consuming it.

Collaboration gets a shared surface. Because the task and its output live in shared infrastructure, a teammate can inspect an agent's branch, leave review comments, or take it over the same way they would any pull request. You can trigger a task from a phone, a Slack thread, or an issue tracker — the work no longer depends on your specific dev environment being awake, which also means a handoff across time zones stops stalling on "it's on my other machine."

A remote agent you aren't watching can spend 30 minutes confidently heading the wrong direction. The asynchronous model removes your token-by-token oversight, so a vague task description costs more — you discover the misread only when the PR arrives. Write the task like a spec: name the files, state the constraints, and define what "done" looks like.

Picking a cloud agent without locking yourself in

The tools differ most in details that are easy to skip past in a demo. Check how the sandbox is provisioned — can the agent install your dependencies and run your real test suite, or does it work against a stripped-down environment? Check what crosses into the vendor's cloud: your repository, and any secrets you grant for builds or integration tests, are processed on third-party infrastructure. Check the pricing model, because per-task, usage-based, and per-seat plans reward very different workloads. And check the review path — an agent that opens clean, scoped PRs into your existing flow is far less disruptive than one that invents its own.

Lock-in is the quiet risk. Because the task description is the real interface, a task written for one agent usually transfers to another. Keep your prompts in version control or an issue tracker rather than buried in a vendor UI, and switching costs stay low.

Conductor's angle — orchestrating multiple agents rather than running one — is worth weighing here. If your work splits cleanly into independent chunks, an orchestrator that launches and tracks several remote agents at once can beat a single-agent tool. If you mostly delegate one task at a time, that coordination layer is overhead you don't need yet.

Start with a contained, low-stakes task — a flaky test, a small dependency bump, a docs fix — before you delegate anything on a critical path. You learn an agent's failure modes on work that is cheap to throw away.

If you want to test the remote model without leaving an editor you already know, Cursor is a low-risk place to start: the same IDE for interactive edits, plus background agents for delegated tasks.

The remote agent category is young, and Conductor's arrival is a signal more than a verdict. The question for tooling teams is shifting from "which assistant runs in my IDE" to "where does my agent work run, and who can pick it up" — and that is worth deciding deliberately rather than by default.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.

CI/CD avec GitHub Actions

Ulrich (Houngbe) — Wed, 20 May 2026 08:50:45 +0000

CI/CD avec GitHub Actions : Guide Complet

GitHub Actions révolutionne l'intégration et le déploiement continus en intégrant directement ces fonctionnalités dans votre repository GitHub. Ce guide vous accompagne dans la mise en place d'une pipeline CI/CD robuste.

Qu'est-ce que GitHub Actions ?

GitHub Actions est une plateforme d'automatisation qui permet de :

Exécuter des workflows basés sur des événements
Automatiser les tests, builds et déploiements
Créer des actions personnalisées réutilisables
S'intégrer parfaitement avec l'écosystème GitHub

Concepts Fondamentaux

1. Composants Clés

Workflow : Processus automatisé défini dans un fichier YAML
Job : Ensemble de steps qui s'exécutent sur un runner
Step : Action individuelle (commande ou action)
Runner : Machine virtuelle qui exécute les jobs
Action : Application réutilisable qui effectue une tâche

2. Structure d'un Workflow

name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '18'
      - name: Install dependencies
        run: npm ci
      - name: Run tests
        run: npm test

Pipeline CI Complète

1. Workflow Python avec Tests

name: Python CI

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Cache pip dependencies
        uses: actions/cache@v3
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
          restore-keys: |
            ${{ runner.os }}-pip-

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install flake8 black mypy
          pip install -r requirements.txt

      - name: Lint with flake8
        run: |
          flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
          flake8 . --count --exit-zero --max-complexity=10 --max-line-length=88 --statistics

      - name: Format check with black
        run: black --check .

      - name: Type check with mypy
        run: mypy src/

  test:
    needs: lint
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ['3.9', '3.10', '3.11']

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install pytest pytest-cov
          pip install -r requirements.txt

      - name: Run tests with coverage
        run: |
          pytest --cov=src --cov-report=xml --cov-report=html

      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3
        with:
          file: ./coverage.xml
          flags: unittests
          name: codecov-umbrella

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run security scan
        uses: pypa/gh-action-pip-audit@v1.0.8
        with:
          inputs: requirements.txt

      - name: Run Bandit security check
        run: |
          pip install bandit
          bandit -r src/ -f json -o bandit-report.json

      - name: Upload security report
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: security-report
          path: bandit-report.json

2. Workflow Node.js/React

name: Node.js CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest

    strategy:
      matrix:
        node-version: [16, 18, 20]

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Use Node.js ${{ matrix.node-version }}
        uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linting
        run: npm run lint

      - name: Run type checking
        run: npm run type-check

      - name: Run unit tests
        run: npm test -- --coverage --watchAll=false

      - name: Run integration tests
        run: npm run test:integration

      - name: Build application
        run: npm run build

      - name: Upload build artifacts
        uses: actions/upload-artifact@v4
        with:
          name: build-files-${{ matrix.node-version }}
          path: dist/

  e2e-tests:
    needs: test
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '18'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Install Playwright
        run: npx playwright install --with-deps

      - name: Start application
        run: |
          npm run build
          npm run start &
          sleep 30

      - name: Run Playwright tests
        run: npx playwright test

      - name: Upload test results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: playwright-report
          path: playwright-report/

Pipeline CD avec Déploiement

1. Déploiement sur AWS

name: Deploy to AWS

on:
  push:
    branches: [main]
  workflow_dispatch:

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '18'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Build application
        run: npm run build
        env:
          REACT_APP_API_URL: ${{ secrets.API_URL }}
          REACT_APP_ENV: production

      - name: Deploy to S3
        run: |
          aws s3 sync dist/ s3://${{ secrets.S3_BUCKET }} --delete

      - name: Invalidate CloudFront
        run: |
          aws cloudfront create-invalidation \
            --distribution-id ${{ secrets.CLOUDFRONT_DISTRIBUTION_ID }} \
            --paths "/*"

      - name: Notify Slack
        if: always()
        uses: 8398a7/action-slack@v3
        with:
          status: ${{ job.status }}
          channel: '#deployments'
          webhook_url: ${{ secrets.SLACK_WEBHOOK_URL }}

2. Déploiement Docker

name: Docker Build and Deploy

on:
  push:
    branches: [main]
    tags: ['v*']

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=ref,event=pr
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Deploy to staging
        if: github.ref == 'refs/heads/main'
        run: |
          echo "Deploying to staging environment"
          # Commandes de déploiement ici

Workflows Avancés

1. Matrix Strategy avec Conditions

name: Cross-platform Testing

on: [push, pull_request]

jobs:
  test:
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, windows-latest, macos-latest]
        python-version: ['3.9', '3.10', '3.11']
        exclude:
          - os: windows-latest
            python-version: '3.9'

    runs-on: ${{ matrix.os }}

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install dependencies (Unix)
        if: runner.os != 'Windows'
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Install dependencies (Windows)
        if: runner.os == 'Windows'
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
        shell: cmd

      - name: Run tests
        run: pytest

2. Déploiement Multi-Environnement

name: Multi-Environment Deploy

on:
  push:
    branches: [main, develop]

jobs:
  deploy-staging:
    if: github.ref == 'refs/heads/develop'
    runs-on: ubuntu-latest
    environment:
      name: staging
      url: https://staging.example.com

    steps:
      - name: Deploy to Staging
        run: |
          echo "Deploying to staging"
          # Logique de déploiement staging

  deploy-production:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://example.com

    steps:
      - name: Deploy to Production
        run: |
          echo "Deploying to production"
          # Logique de déploiement production

Actions Personnalisées

1. Action Composite

# .github/actions/setup-node-cache/action.yml
name: 'Setup Node with Cache'
description: 'Setup Node.js with dependency caching'

inputs:
  node-version:
    description: 'Node.js version'
    required: false
    default: '18'
  working-directory:
    description: 'Working directory'
    required: false
    default: '.'

runs:
  using: 'composite'
  steps:
    - name: Setup Node.js
      uses: actions/setup-node@v4
      with:
        node-version: ${{ inputs.node-version }}
        cache: 'npm'
        cache-dependency-path: ${{ inputs.working-directory }}/package-lock.json

    - name: Install dependencies
      working-directory: ${{ inputs.working-directory }}
      run: npm ci
      shell: bash

2. Usage de l'Action

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js with cache
        uses: ./.github/actions/setup-node-cache
        with:
          node-version: '18'
          working-directory: './frontend'

Bonnes Pratiques

1. Sécurité

# Utilisation des secrets
steps:
  - name: Deploy with secrets
    env:
      API_KEY: ${{ secrets.API_KEY }}
      DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
    run: |
      echo "API_KEY is set: ${API_KEY:+yes}"
      # Utiliser les secrets de manière sécurisée

2. Optimisation des Performances

# Cache des dépendances
- name: Cache dependencies
  uses: actions/cache@v3
  with:
    path: |
      ~/.npm
      node_modules
    key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      ${{ runner.os }}-npm-

# Parallélisation des jobs
jobs:
  lint:
    runs-on: ubuntu-latest
  test:
    runs-on: ubuntu-latest
  build:
    needs: [lint, test]
    runs-on: ubuntu-latest

3. Monitoring et Notifications

  - name: Notify on failure
    if: failure()
    uses: actions/github-script@v7
    with:
      script: |
        github.rest.issues.createComment({
          issue_number: context.issue.number,
          owner: context.repo.owner,
          repo: context.repo.repo,
          body: '❌ Pipeline failed! Please check the logs.'
        })

Conclusion

GitHub Actions offre une solution puissante et flexible pour l'automatisation :

Intégration native avec GitHub
Écosystème riche d'actions réutilisables
Scaling automatique des runners
Support multi-plateforme
Gestion fine des permissions

Une pipeline bien conçue améliore la qualité du code, réduit les risques de déploiement et accélère le cycle de développement. L'investissement initial en configuration est rapidement rentabilisé par la réduction des erreurs et l'automatisation des tâches répétitives.

kop — A Modern Kubernetes Terminal UI Built with Python

vegaoqiang — Wed, 20 May 2026 08:49:16 +0000

I spend most of my day managing Kubernetes clusters from the terminal.
Like many engineers, I rely heavily on:

kubectl
SSH
log streaming
pod debugging
cluster troubleshooting

Over time, I realized I wanted something that felt more interactive and efficient than constantly switching between terminal commands and browser dashboards.

So I started building kop.

What is kop?

kop is a modern terminal UI for Kubernetes built with Python.
It provides an interactive Kubernetes experience directly inside the terminal while remaining lightweight, keyboard-driven, and SSH-friendly.

Project site:

GitHub

kop Documentation

Why Another Kubernetes TUI?

Tools like K9sare already excellent.But while using existing tools, I still wanted:

A cleaner modern UI
Better keyboard workflows
Faster navigation
More extensible architecture
Improved terminal interaction
A smoother troubleshooting experience

kop is my attempt to explore those ideas.

Features

Interactive Kubernetes Resource Navigation
Browse resources interactively:

Pods
Deployments
StatefulSets
DaemonSets
Services
Nodes
Namespaces
ConfigMaps
Secrets
Events
...

Screenshots

CLI ile Aracı Token Maliyetlerini Düşürme (2026 Rehberi)

Tobias Hoffmann — Wed, 20 May 2026 08:46:49 +0000

Bir CLI kodlama ajanı, fatura gelene kadar “özgür” görünür. Claude Code veya Codex’i bir depoya yönlendirip bir modülü yeniden düzenlemesini istediğinizde, on dakika içinde kırk dosya okumuş, testleri üç kez çalıştırmış ve aslında hiç gerek olmayan bağlam için altı haneli token harcamış olabilir. Bunu gün boyu ajan kullanan sekiz kişilik bir ekiple çarptığınızda maliyet hızla büyür. İyi haber: Kodlama ajanlarında token israfının büyük kısmı, modeli değiştirmeden veya çıktı kalitesinden vazgeçmeden komut satırı alışkanlıklarıyla azaltılabilir.

Apidog'u bugün deneyin

TL;DR

Ajan maliyetini düşürmek için modele gitmeden önce bağlamı küçültün:

Çalışma kümesini dosya/dizin bazında sınırlayın.
CLAUDE.md gibi bellek dosyalarını kısa tutun.
Uzun oturumlarda /compact veya /clear kullanın.
Kararlı ön ekler için istem önbellekleme açın.
Basit alt görevleri daha ucuz modele yönlendirin.
Test, kurulum ve diff çıktısını filtreleyin.
Her çalıştırmanın token ve maliyetini ölçün.

Giriş

Sorun genelde iki şekilde görünür: Ya görev ortasında haftalık/oturum limitine çarparsınız ya da ay sonunda API faturası gelir ve “AI asistanı neden bu kadar pahalı?” sorusu sorulur.

Temel neden aynıdır: CLI ajanları varsayılan olarak çok fazla bağlam taşır. On satır koda ihtiyaç duyduklarında tüm dosyayı okurlar, her dönüşte konuşma geçmişini yeniden gönderirler, komut çıktılarını ham haliyle bağlama eklerler ve aynı sistem istemini binlerce kez tekrar yollarlar.

Bu kaçınılmaz değildir. 2.000 token’lık kod üzerinde akıl yürütmesi gereken bir refactor işleminin 180.000 token bağlama ihtiyacı yoktur. Aradaki fark sizin tasarruf alanınızdır.

Bu rehberde şu başlıkları uygulamalı olarak ele alacağız:

CLI ajan çalışmasında token’ların nereye gittiği
Bağlam hijyeni ve bellek dosyaları
İstem önbellekleme
Model yönlendirme
Araç çıktısını ve dosya alımını kısaltma
Çalıştırma başına maliyet ölçümü

Örnekler Claude Code ve Codex üzerinden ilerliyor, ancak aynı ilkeler token tabanlı API kullanan çoğu kodlama ajanı için geçerlidir.

Ek bir maliyet kaynağı da API hata ayıklamasıdır. Güvenilmez bir dahili API’ye bağlanan ajan, her denemede token harcar: yeniden dener, hata gövdelerini okur, belgeleri tekrar tarar ve aynı döngüye girer.

💡 Ajanlarınız API’lerle çalışıyorsa, API’leri ajana vermeden önce Apidog’da tasarlamak, mock verilerle test etmek ve sözleşmesini doğrulamak pahalı deneme-yanılma döngülerini azaltır. Ajan, sürprizlerle dolu canlı bir endpoint yerine beklenen davranışı olan bir API sözleşmesine karşı çalışır.

CLI Ajan Çalışmasında Token’lar Nereye Gider?

Bir ajan dönüşünde modele giriş yükü gönderilir ve modelden çıktı alınır. İkisi için de ödeme yaparsınız. Çoğu sağlayıcıda çıktı token’ları, giriş token’larından daha pahalıdır.

Örnek olarak bir model ailesinde:

Giriş: milyon token başına yaklaşık 3$
Çıkış: milyon token başına yaklaşık 15$
Daha küçük modelde giriş: yaklaşık 1$
Daha küçük modelde çıkış: yaklaşık 5$

Bunları sabit fiyat olarak değil, oranları anlamak için örnek kabul edin. Güncel fiyatları her zaman sağlayıcının fiyatlandırma sayfasından kontrol edin.

Tipik bir ajan çalışmasında giriş token’larını büyüten kalemler şunlardır:

Sistem istemi ve araç tanımları: Ajan talimatları ve araç JSON şemaları. Her dönüşte tekrar gönderilir.
Bellek/proje dosyaları: CLAUDE.md gibi dosyalar, depo kuralları ve kalıcı talimatlar.
Konuşma geçmişi: Önceki kullanıcı mesajları, model yanıtları, araç çağrıları ve araç sonuçları.
Okunan dosya içerikleri: Ajanın açtığı kaynak dosyalar.
Araç çıktısı: Test logları, npm install çıktısı, git diff, stack trace’ler.

En kritik nokta: Konuşma geçmişi her dönüşte yeniden oynatılır.

30 dönüşlük bir oturum, tek dönüşün 30 katı değildir; büyüyen bir ön ekin tekrar tekrar gönderilmesine benzer. Bu yüzden uzun ve dağınık oturumlar pahalıdır.

Oturum ve limit muhasebesi hakkında daha fazla detay için Claude Code token penceresinin nasıl sıfırlandığına dair açıklama faydalı bir tamamlayıcıdır.

1. Bağlam Hijyeni ve Bellek Dosyaları

En ucuz token, hiç göndermediğiniz tokendir.

Çalışma kümesini baştan sınırlayın

Ajanı depo kökünde açıp şunu demeyin:

claude "fatura mantığını yeniden düzenle"

Bu, ajanın gereksiz keşif yapmasına yol açar.

Bunun yerine dosya ve kapsam belirtin:

claude "src/payments/retry.ts ve ilgili test dosyasında üstel geri çekilmeyi kullanacak şekilde yeniden deneme mantığını yeniden düzenle"

Daha iyi bir istem şunları içerir:

Değiştirilecek dosyalar
Değişiklik hedefi
Test kapsamı
Okunmaması gereken alanlar

Örnek:

claude "
Sadece şu dosyaları kullan:
- src/payments/retry.ts
- src/payments/retry.test.ts

Amaç:
Yeniden deneme mantığını üstel geri çekilme kullanacak şekilde düzenle.

Kısıtlar:
- Public API imzasını değiştirme.
- Başka dizinleri tarama.
- Sadece başarısız testleri çalıştır.
"

Bellek dosyasını kısa tutun

CLAUDE.md veya eşdeğer proje bellek dosyası her dönüşte bağlama eklenebilir. Bu dosya büyüdükçe her ajan çağrısının sabit maliyeti artar.

Yaklaşık token sayısını kontrol edin:

wc -c CLAUDE.md | awk '{print "≈", int($1/4), "token/dönüş"}'

İyi bir CLAUDE.md şunları içermelidir:

# Proje Kuralları

## Komutlar

- Test: npm test --silent
- Lint: npm run lint
- Typecheck: npm run typecheck

## Kod Kuralları

- Public API imzalarını değiştirme.
- Yeni bağımlılık eklemeden önce sor.
- Test eklemeden refactor tamamlanmış sayılmaz.

## Belgeler

- API sözleşmeleri: docs/api.md
- Ödeme akışı: docs/payments.md

Kötü örnek:

# Proje Hakkında

Bu proje 2021 yılında başladı...
Burada tüm mimari geçmişi...
Tüm endpoint belgeleri...
Tüm veritabanı tabloları...
Tüm onboarding notları...

Bellek dosyası wiki değildir. Sık kullanılan kuralları tutun, detaylı belgeleri linkleyin.

Uzun oturumları sıkıştırın veya temizleyin

Bir görev bittiğinde aynı konuşma içinde yeni göreve geçmeyin. Eski bağlam yeni görevin maliyetini artırır.

Claude Code’da:

/compact

Bu, geçmişi kısa bir özete dönüştürür.

Yeni ve ilgisiz bir göreve başlıyorsanız:

/clear

Basit kural:

Aynı mantıksal görev: devam edin.
Yeni görev: /compact veya /clear.

Claude Code iş akışlarındaki desenler bu kapsam belirleme alışkanlığına dayanır.

Ignore dosyası kullanın

Ajanın şu dizinleri okumasını engelleyin:

node_modules/
dist/
build/
coverage/
.next/
.vendor/
*.lock

Ajan node_modules/ veya dist/ görmezse, bu dosyaları okuyup token harcayamaz.

2. İstem Önbellekleme: Aynı Ön Ek İçin Tekrar Ödeme Yapmayın

İstem önbellekleme, sistem istemi, araç tanımları ve depo kuralları gibi kararlı ön eklerin sağlayıcı tarafından önbelleğe alınmasını sağlar.

Ekonomi basittir:

İlk yazma normal girişten biraz pahalıdır.
Sonraki okumalar yaklaşık %90 indirimlidir.
Büyük ve sabit ön eklerde ciddi tasarruf sağlar.

Kararlı içeriği önce, değişken içeriği sonra koyun:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT + REPO_CONVENTIONS,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": user_task,
        }
    ],
)

Kullanımı ölçün:

u = response.usage

print("önbellek yazma:", u.cache_creation_input_tokens)
print("önbellek okuma :", u.cache_read_input_tokens)
print("yeni giriş     :", u.input_tokens)

Dikkat edilmesi gerekenler:

Önbellek kesme noktasından önce değişken veri koymayın.
Sistem istemine timestamp eklemeyin.
Aynı işleri yakın zaman aralığında çalıştırın.
Ön ek bayt düzeyinde kararlı olmalıdır.

Örnek senaryo:

Sistem istemi + repo kuralları: 8.000 token
Günlük çağrı sayısı: 60

Önbellekleme yoksa 8.000 token için 60 kez ödeme yaparsınız. Önbellekleme varsa ilk yazmadan sonra çoğu çağrı indirimli okunur.

OpenAI tarafında da desteklenen modellerde önbelleğe alınmış giriş için benzer indirimler otomatik uygulanabilir. Ayarlar değişebilir, ancak ilke aynıdır.

Codex tarafında yönlendirme ve ücretsiz katman seçenekleri için Codex aracılığıyla GPT-5.5’i ücretsiz çalıştırma rehberi bu stratejiyi tamamlar.

3. Model Yönlendirme: Basit İşe Ucuz Model

Her görev en güçlü modeli gerektirmez.

Şunlar genellikle küçük modelle yapılabilir:

Commit mesajı yazma
Diff özetleme
Changelog girdisi oluşturma
Basit test taslağı üretme
Değişken yeniden adlandırma
Lint hatası açıklama

Şunlar daha güçlü model gerektirebilir:

Mimari karar
Karmaşık refactor
Çok dosyalı hata ayıklama
Performans analizi
Güvenlik açısından kritik değişiklik

CLI’dan model seçimi:

claude --model haiku "evreye alınmış diff için conventional commit mesajı yaz"

claude --model sonnet "ödeme servisi için cache katmanını yeniden tasarla"

Varsayılan strateji:

# Basit görevler için ucuz model
alias ccheap='claude --model haiku'

# Zor görevler için güçlü model
alias cstrong='claude --model sonnet'

Kullanım:

ccheap "bu diff'i 5 maddede özetle"

cstrong "src/payments içindeki retry ve idempotency akışını analiz et"

Çoğu ekip tersini yapar: Her şeyi pahalı modelle çalıştırır ve basit işler için gereksiz ödeme yapar.

Alt-ajan destekleyen sistemlerde de aynı kural geçerlidir:

Üst ajan: güçlü model, küçük karar sayısı
Alt ajan: ucuz model, dar görev, kısa bağlam

Örneğin:

Alt ajan görevi:
Sadece src/payments dizininde retry ile ilgili fonksiyonları bul.
Kod değiştirme.
En fazla 10 maddelik özet döndür.

Üst ajan sadece özetlenmiş sonucu görür; tüm arama bağlamını taşımak zorunda kalmaz.

Bu yetki devri desenleri için Codex ve Claude Code’daki hedef komut rehberindeki otonom döngü örnekleri faydalıdır.

Limitli plan kullanıyorsanız model yönlendirme yalnızca maliyeti değil, hakkınızın ne kadar dayanacağını da etkiler. Claude Code haftalık limit artışı yardımcı olabilir, ancak doğru yönlendirme hâlâ kritik kaldıraçtır.

4. Araç Çıktısını Kısaltın

Ajanın çalıştırdığı her komut metin döndürür. Bu metin bağlama eklenir ve sonraki dönüşlerde tekrar gönderilir.

Gürültülü komutlar pahalıdır:

npm test
npm install
git diff
pytest

Komutları sessiz çalıştırın

# Gürültülü
npm test

# Daha kısa çıktı
npm test --silent -- --reporter=dot

# Gürültülü
npm install

# Daha kısa çıktı
npm install --silent --no-audit --no-fund

Çıktıyı filtreleyin

pytest -q 2>&1 | tail -n 30

git diff --stat

npm test 2>&1 | grep -E "(FAIL|✗|Error)" | head -n 20

Ajan çoğu zaman tüm loga değil, şu bilgilere ihtiyaç duyar:

Test geçti mi?
Hangi test başarısız oldu?
Hata mesajı nedir?
İlgili stack trace’in son kısmı nedir?

Büyük diff’leri sınırlayın

Şunu yaptırmayın:

git diff

Önce özet alın:

git diff --stat

Sonra yalnızca ilgili dosyayı açtırın:

git diff -- src/payments/retry.ts

Kilit dosyaları özellikle tehlikelidir:

git diff -- package-lock.json

Bu çıktı binlerce satır olabilir. Ajanın genelde tamamına ihtiyacı yoktur.

5. Tüm Dosya Yerine Hedefli Okuma Kullanın

Bir fonksiyonu değiştirmek için 1.500 satırlık dosyanın tamamını okumak israftır.

Daha iyi yaklaşım:

grep -n "function retryPayment" src/payments/retry.ts

Sonra ilgili pencereyi okuyun:

sed -n '120,220p' src/payments/retry.ts

Ajan istemine bunu açıkça yazın:

Tüm dosyayı okuma.
Önce retry mantığını grep ile bul.
Sadece ilgili fonksiyonun çevresindeki yaklaşık 100 satırı oku.

Büyük dosyalarda bu fark ciddi olabilir:

Tüm dosya: ~18.000 token
Hedefli pencere: ~800 token

6. RAG ve Doküman Alımını Sınırlayın

Ajanınız kod tabanı araması veya doküman RAG kullanıyorsa, alınan parçaların sayısı ve boyutu maliyeti doğrudan etkiler.

Kötü varsayılan:

top_k = 50
chunk_size = 800

Bu, modele 40.000 token’a yakın bağlam taşıyabilir.

Daha iyi başlangıç:

top_k = 8
chunk_size = 200

Pratik kural:

Daha az ama daha alakalı parça
Kısa chunk
Soruya göre filtrelenmiş dizin
Gereksiz doküman koleksiyonlarını dışarıda bırakma

Örnek istem:

Sadece docs/payments ve src/payments altında arama yap.
Genel README, changelog ve onboarding belgelerini kullanma.
En fazla 8 kısa sonuç döndür.

7. Her Çalıştırmanın Maliyetini Ölçün

Ölçmediğiniz maliyeti azaltamazsınız.

API yanıtındaki kullanım bilgilerini kaydedin:

u = response.usage

INPUT_RATE  = 3.00 / 1_000_000
OUTPUT_RATE = 15.00 / 1_000_000
CACHE_READ  = 0.30 / 1_000_000
CACHE_WRITE = 3.75 / 1_000_000

cost = (
    u.input_tokens * INPUT_RATE +
    u.output_tokens * OUTPUT_RATE +
    u.cache_read_input_tokens * CACHE_READ +
    u.cache_creation_input_tokens * CACHE_WRITE
)

print(
    f"çalıştırma maliyeti ≈ ${cost:.4f} "
    f"(girdi={u.input_tokens}, "
    f"çıktı={u.output_tokens}, "
    f"önbellek okuma={u.cache_read_input_tokens})"
)

CLI kullanıyorsanız:

claude /cost

Daha iyi izleme için ajan çağrısını küçük bir wrapper içine alın:

#!/usr/bin/env bash

TASK="$1"
START=$(date -Iseconds)

claude "$TASK"

END=$(date -Iseconds)

echo "$START,$END,$TASK" >> agent-runs.csv

Daha gelişmiş sürümde şunları da kaydedin:

Model
Proje
Görev etiketi
Giriş token
Çıkış token
Cache read/write token
Yaklaşık maliyet

API anahtarlarını da ayırın:

Proje başına anahtar
Ajan türü başına anahtar
Ortam başına anahtar

Böylece harcama tek bir toplamda kaybolmaz.

Taktik Karşılaştırması

Taktik	Tipik token tasarrufu	Çaba
Çalışma kümesini sınırlama	Çalıştırma başına girdide %30–60	Düşük
Kısa ve kararlı bellek dosyası	Her dönüşte %5–15	Düşük
`/compact` veya `/clear`	Uzun oturumlarda %40–80	Düşük
İstem önbellekleme	Önbelleğe alınan ön ekte ~%90	Orta
Model yönlendirme	Yönlendirilen alt görevlerde %50–80	Orta
Sessiz/filtrelenmiş araç çıktısı	Araç yoğun çalıştırmalarda %20–50	Düşük
Hedefli dosya okuma	Büyük dosyalarda %70–95	Düşük
Kısıtlı RAG alımı	RAG yoğun ajanlarda %30–60	Orta
Çalıştırma başına maliyet ölçümü	Doğrudan %0; diğerlerini mümkün kılar	Düşük

Tasarruf oranları örnektir. Gerçek kazanç, mevcut israf seviyenize bağlıdır.

Uygulanabilir Kontrol Listesi

Başlamak için sırayla şunları yapın:

[ ] CLAUDE.md dosyasını 1.000 token altına indir.
[ ] node_modules, dist, build, coverage ve lock dosyalarını ignore et.
[ ] Ajan istemlerinde dosya/dizin kapsamı belirt.
[ ] Test komutlarını sessiz moda al.
[ ] git diff yerine önce git diff --stat kullan.
[ ] Uzun oturumlarda /compact kullan.
[ ] Yeni görevlerde /clear ile başla.
[ ] Commit mesajı/diff özeti gibi işleri ucuz modele yönlendir.
[ ] Kararlı sistem istemi için prompt caching aç.
[ ] Her çalıştırma için token ve maliyet kaydı tut.

Sonuç

Ajan token maliyetlerinin çoğu modelden değil, çalışma şeklinden kaynaklanır. Gereksiz bağlam, uzun konuşma geçmişi, ham araç çıktısı ve yanlış model seçimi faturayı büyütür.

Önce düşük çabalı adımları uygulayın:

Kapsamı daraltın.
Bellek dosyasını küçültün.
Araç çıktısını sessizleştirin.
Uzun oturumları temizleyin.
Basit işleri ucuz modele taşıyın.

Ardından istem önbellekleme ve maliyet ölçümü ekleyin. Bu kombinasyon, çıktı kalitesini düşürmeden ajan maliyetlerini yönetilebilir hale getirir.

วิธีลดค่าใช้จ่ายโทเค็นเอเจนต์จาก CLI (คู่มือปี 2026)

Thanawat Wongchai — Wed, 20 May 2026 08:45:04 +0000

เอเจนต์เขียนโค้ด CLI อย่าง Claude Code หรือ Codex ช่วยให้ทำงานเร็วขึ้น แต่ค่าโทเค็นจะพุ่งทันทีถ้าปล่อยให้มันอ่าน repo ทั้งก้อน รัน test ซ้ำหลายรอบ และส่งประวัติสนทนาทั้งหมดกลับเข้าโมเดลทุกเทิร์น ข่าวดีคือคุณลดค่าใช้จ่ายได้จากบรรทัดคำสั่ง: จำกัดบริบท, ทำ output ให้สั้น, ใช้ prompt caching, route งานง่ายไปโมเดลราคาถูก และวัดต้นทุนต่อการรันจริง

ลองใช้ Apidog วันนี้

สรุปสั้นๆ (TL;DR)

ถ้าต้องการลดค่าใช้จ่ายโทเค็นของ CLI coding agents ให้เริ่มจากสิ่งเหล่านี้:

จำกัด scope ก่อนให้เอเจนต์ทำงาน
ทำ CLAUDE.md หรือ memory file ให้สั้น
ใช้ /compact หรือ /clear เมื่อเปลี่ยนงาน
เปิด prompt caching สำหรับ prefix ที่เสถียร
ใช้โมเดลราคาถูกกับงานย่อยที่ความเสี่ยงต่ำ
ลด output จาก test runner, install command, logs และ diff
วัด token usage และต้นทุนต่อ run

เป้าหมายคือส่งเฉพาะบริบทที่จำเป็นให้โมเดล ไม่ใช่ส่งทั้ง repo และประวัติทั้งหมดทุกครั้ง

ทำไม CLI agents ถึงกินโทเค็นเยอะ

ปัญหาหลักไม่ได้อยู่ที่โมเดลอย่างเดียว แต่อยู่ที่พฤติกรรม default ของเอเจนต์:

อ่านไฟล์ทั้งไฟล์ ทั้งที่ต้องใช้แค่ฟังก์ชันเดียว
ส่ง system prompt, tool definitions และ repo context ซ้ำทุกเทิร์น
เล่นซ้ำ conversation history ทั้งหมด
dump log จาก test runner หรือ command line กลับเข้า context
ใช้โมเดลแพงกับงานง่าย เช่น commit message หรือ changelog

ตัวอย่าง: งาน refactor ที่ต้องใช้โค้ดจริงประมาณ 2,000 tokens อาจกลายเป็น request ขนาด 180,000 tokens ได้ ถ้าเอเจนต์อ่านหลายไฟล์ รัน test verbose และแบก history ยาวๆ ไปด้วย

ค่าใช้จ่ายที่ซ่อนอีกจุดคือการ debug API เอเจนต์ที่เรียก API ภายในซึ่งไม่เสถียรจะ retry, อ่าน error, อ่าน docs ซ้ำ และวนหลายรอบ ทุก loop มีค่าโทเค็นเต็ม

💡 หากเอเจนต์ของคุณต้องทำงานกับ API ให้ design, mock และ test API เหล่านั้นใน Apidog ก่อน แล้วค่อยให้เอเจนต์เขียนโค้ดกับ contract ที่คาดเดาได้ วิธีนี้ช่วยลดการลองผิดลองถูกกับ endpoint จริงที่อาจสร้าง error และกินโทเค็นโดยไม่จำเป็น

โทเค็นหายไปไหนในการรันจริง

หนึ่ง turn ของเอเจนต์มีทั้ง input tokens และ output tokens คุณจ่ายทั้งสองฝั่ง โดย output tokens มักแพงกว่า input tokens หลายเท่า

สิ่งที่อยู่ใน input payload มักประกอบด้วย:

System prompt และ tool definitions

เช่น instruction ของ agent และ JSON schema ของ tools มักถูกส่งซ้ำทุก turn
Memory/project files

เช่น CLAUDE.md, coding conventions, repo rules
Conversation history

user message, model response, tool call และ tool output เก่าทั้งหมด
ไฟล์ที่เอเจนต์อ่าน

การอ่านไฟล์ 1,200 บรรทัดครั้งเดียวอาจมีขนาด 12,000–18,000 tokens
Tool output

log จาก npm install, test failure, stack trace, git diff, lockfile diff

จุดสำคัญคือ conversation history ถูกส่งซ้ำทุก turn ดังนั้น session 30 turns ไม่ได้แพงแค่ 30 เท่าของ 1 turn แต่แพงขึ้นตาม prefix ที่โตขึ้นเรื่อยๆ

ถ้าต้องการเข้าใจเรื่อง session/window เพิ่มเติม อ่านได้ที่ วิธีการรีเซ็ตหน้าต่างโทเค็นของ Claude Code

1. จำกัด scope ก่อนให้เอเจนต์เริ่มงาน

โทเค็นที่ถูกที่สุดคือโทเค็นที่ไม่ต้องส่ง อย่าเริ่มจาก prompt กว้างๆ เช่น:

claude "refactor the billing logic"

ให้ระบุไฟล์และขอบเขตให้ชัด:

claude "refactor the retry logic so it uses exponential backoff,
only in src/payments/retry.ts and src/payments/retry.test.ts"

ถ้าต้องให้เอเจนต์สำรวจ codebase ให้จำกัด directory:

claude "find the payment retry implementation under src/payments only,
then propose the minimal change"

แนวทางใช้งาน:

หนึ่ง prompt ควรมีหนึ่งงานหลัก
ระบุไฟล์หรือ directory
บอกสิ่งที่ห้ามแตะ เช่น migration, generated files, lockfiles
หลีกเลี่ยงคำสั่งประเภท “scan the whole repo”

2. ทำ memory file ให้สั้น

ไฟล์อย่าง CLAUDE.md ถูกโหลดเข้า context บ่อยมาก ถ้ามันกลายเป็น wiki ยาว 4,000 tokens ทีมจะจ่ายซ้ำทุก turn

ตรวจขนาดแบบคร่าวๆ:

wc -c CLAUDE.md | awk '{print "≈", int($1/4), "tokens per turn"}'

ควรเก็บเฉพาะ:

# CLAUDE.md

## Commands
- Run tests: npm test --silent -- --reporter=dot
- Typecheck: npm run typecheck
- Lint: npm run lint

## Rules
- Do not edit generated files.
- Do not modify package-lock.json unless dependency changes are requested.
- Keep changes minimal and scoped to the requested files.

## References
- API contracts: docs/api/
- Architecture notes: docs/architecture.md

ไม่ควรใส่:

onboarding docs ทั้งชุด
changelog ยาวๆ
architecture document ทั้งไฟล์
ตัวอย่างโค้ดจำนวนมากที่ไม่ได้ใช้ทุกงาน

ถ้าเอกสารบางส่วนใช้เดือนละครั้ง ให้เก็บไว้เป็นไฟล์แยก แล้วให้เอเจนต์อ่านเมื่อจำเป็น

3. ใช้ `/compact` หรือ `/clear` เมื่อเปลี่ยนงาน

session ยาวคือแหล่งกินโทเค็นหลัก เพราะทุก turn ใหม่ต้องแบก history เก่าทั้งหมด

ใน Claude Code:

/compact

ใช้เมื่อ session ยาว แต่ยังอยากเก็บสรุปของงานเดิมไว้

/clear

ใช้เมื่อเริ่มงานใหม่ที่ไม่เกี่ยวข้อง

กฎง่ายๆ:

หนึ่ง logical task ต่อหนึ่ง session
หลัง refactor เสร็จแล้ว อย่าใช้ session เดิมไปเขียน docs ต่อ
ก่อนเริ่มงานใหม่ ให้ /clear
ถ้างานยังต่อเนื่องแต่ history ยาว ให้ /compact

ดู workflow เพิ่มเติมได้ที่ เวิร์กโฟลว์ของ Claude Code

4. ใช้ ignore files เพื่อตัดไฟล์ที่ไม่ควรเห็น

ให้เอเจนต์หลีกเลี่ยงไฟล์ที่สร้างขึ้นหรือไม่ควรแก้ เช่น:

node_modules/
dist/
build/
coverage/
.next/
.nuxt/
*.log
package-lock.json
pnpm-lock.yaml
yarn.lock

ถ้าเอเจนต์ไม่เห็น dist/, coverage/ หรือ lockfile diff มันก็ไม่เสียโทเค็นกับสิ่งเหล่านั้น

เพิ่มกฎใน memory file ด้วย:

Do not read or edit generated files, build output, coverage output, or dependency directories.

5. เปิด prompt caching สำหรับ prefix ที่เสถียร

Prompt caching ช่วยลดค่าใช้จ่ายของ prefix ที่ส่งซ้ำ เช่น system prompt, tools และ repo conventions

แนวคิดคือ:

วางข้อมูลที่เสถียรไว้ด้านหน้า
วาง input ที่เปลี่ยนบ่อยไว้ด้านหลัง
cache เฉพาะ prefix ที่ไม่เปลี่ยน

ตัวอย่างถ้าเรียก API เอง:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT + REPO_CONVENTIONS,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": user_task,
        }
    ],
)

u = response.usage
print("cache write:", u.cache_creation_input_tokens)
print("cache read :", u.cache_read_input_tokens)
print("fresh input:", u.input_tokens)

ข้อควรระวัง:

prefix ต้อง byte-stable
อย่าใส่ timestamp หรือข้อมูล runtime ไว้ก่อน cache boundary
ถ้าเปลี่ยน character เดียวก่อนจุด cache อาจทำให้ cache miss
จัดกลุ่มงานที่เกี่ยวข้องให้รันใกล้กัน เพื่อใช้ cache ที่ยัง warm อยู่

Prompt caching เหมาะมากกับ agent ที่ใช้ system prompt และ repo rules เดิมซ้ำหลายสิบครั้งต่อวัน

ถ้าใช้ Codex หรือ OpenAI models ที่มี cached input discount หลักการคล้ายกัน แม้รายละเอียด implementation ต่างกัน อ่านเสริมได้ที่ การรัน GPT-5.5 ฟรีผ่าน Codex

6. Route งานง่ายไปโมเดลราคาถูก

ไม่ใช่ทุกงานต้องใช้โมเดลหลัก งานเหล่านี้มักใช้โมเดลเล็กได้:

commit message
changelog
summarize diff
generate boilerplate test
explain lint error
rename variable แบบตรงไปตรงมา

ตัวอย่าง:

# งานง่าย ใช้โมเดลถูก
claude --model haiku "write a conventional commit message for the staged diff"

# งานยาก ใช้โมเดลแรง
claude --model sonnet "redesign the caching layer for the payments service"

แนวทางที่ดีคือ default เป็นโมเดลราคาถูก แล้วค่อยยกระดับเมื่อจำเป็น แทนที่จะใช้โมเดลแพงกับทุกอย่าง

ถ้า framework รองรับ sub-agent ให้ใช้โมเดลเล็กกับงานค้นหา/สรุป แล้วส่งผลลัพธ์สั้นๆ กลับให้ parent agent ที่ใช้โมเดลแพง

อ่านรูปแบบ autonomous loop เพิ่มเติมได้ที่ คำสั่ง goal ใน Codex และ Claude Code

การ route ไม่ได้ช่วยแค่ค่าเงิน แต่ยังช่วยยืด quota ในแผนที่มี usage limit ด้วย แม้จะมี การเพิ่มขีดจำกัดรายสัปดาห์ของ Claude Code การใช้โมเดลแพงกับงานเล็กก็ยังเป็นการเผา quota โดยไม่จำเป็น

7. ทำ command output ให้สั้น

Tool output เป็นตัวกินงบแบบเงียบๆ เพราะทุกบรรทัดที่ command print ออกมาอาจถูกส่งกลับเข้า context

เปลี่ยนจาก command ที่ verbose:

npm test

เป็นแบบสั้น:

npm test --silent -- --reporter=dot

ติดตั้ง dependency แบบลด noise:

npm install --silent --no-audit --no-fund

จำกัด test output:

pytest -q 2>&1 | tail -n 30

ดู diff แบบสรุปก่อน:

git diff --stat

กรองเฉพาะ error:

npm test 2>&1 | grep -E "(FAIL|✗|Error)" | head -n 20

ถ้าเอเจนต์ต้อง debug test failure มันมักต้องการแค่:

test ไหน fail
error message
stack trace ส่วนบน
expected vs actual

ไม่ต้องการ log ทั้งหมด 5,000 บรรทัด

8. ให้เอเจนต์อ่านเฉพาะส่วนที่เกี่ยวข้อง

แทนที่จะให้เอเจนต์อ่านไฟล์ 1,500 บรรทัด ให้สั่งให้ค้นหา symbol ก่อน:

claude "find the function that handles payment retry,
read only that function and nearby tests, then suggest the minimal patch"

หรือใช้ shell ช่วยตัดบริบท:

grep -n "function retryPayment" -n src/payments/retry.ts
sed -n '120,220p' src/payments/retry.ts

ถ้าใช้ ripgrep:

rg "retryPayment|exponentialBackoff|RetryPolicy" src/payments

เป้าหมายคือให้โมเดลเห็น context ขนาด 500–1,000 tokens แทนที่จะเป็นทั้งไฟล์ขนาด 18,000 tokens

9. จำกัด retrieval/RAG

ถ้า agent ใช้ code search หรือ RAG บนเอกสาร ให้จำกัดจำนวน chunk และขนาด chunk

ตัวอย่าง configuration ที่ควรตั้ง:

{
  "retrieval": {
    "max_chunks": 10,
    "chunk_size_tokens": 200,
    "include_full_file": false
  }
}

หลักการ:

chunk สั้นแต่ตรงคำถามดีกว่า chunk ยาวจำนวนมาก
อย่าดึง full file ถ้าไม่จำเป็น
ให้ ranking เลือกเฉพาะ context ที่เกี่ยวข้องจริง
log จำนวน retrieved tokens เพื่อวัดผล

คุณจ่ายสำหรับทุก token ที่ดึงมา แม้โมเดลจะไม่ได้ใช้มันตอบก็ตาม

10. วัดต้นทุนต่อการรัน

อย่าดูแค่ bill รายเดือน ให้เก็บต้นทุนต่อ task เช่น:

daily refactor run
PR review run
test-fix run
API integration run

ถ้าเรียก API เอง ให้เก็บ usage:

u = response.usage

INPUT_RATE  = 3.00 / 1_000_000
OUTPUT_RATE = 15.00 / 1_000_000
CACHE_READ  = 0.30 / 1_000_000
CACHE_WRITE = 3.75 / 1_000_000

cost = (
    u.input_tokens * INPUT_RATE +
    u.output_tokens * OUTPUT_RATE +
    u.cache_read_input_tokens * CACHE_READ +
    u.cache_creation_input_tokens * CACHE_WRITE
)

print(
    f"run cost ≈ ${cost:.4f} "
    f"(in={u.input_tokens} out={u.output_tokens} "
    f"cache_read={u.cache_read_input_tokens})"
)

ถ้าใช้ CLI ให้ใช้วิธีเหล่านี้:

# ตรวจ cost ของ session
claude /cost

หรือแยก API key ตาม project/agent เพื่อดู usage จาก provider console

หรือห่อ command ด้วย script ที่ log ข้อมูลลง CSV:

#!/usr/bin/env bash

TASK="$1"
START=$(date -Iseconds)

claude "$TASK"

END=$(date -Iseconds)
echo "$START,$END,$TASK" >> agent-runs.csv

จากนั้นเทียบต้นทุนก่อน/หลังปรับ:

ก่อนใช้ prompt caching
หลังลด CLAUDE.md
หลังเปลี่ยน test reporter
หลัง route งานง่ายไปโมเดลถูก

ถ้าตัวเลขไม่ลด แปลว่ากลยุทธ์นั้นอาจไม่ใช่จุดรั่วหลักของคุณ

ตารางเปรียบเทียบกลยุทธ์

กลยุทธ์	การประหยัดโทเค็นโดยทั่วไป	ความพยายาม
จำกัดขอบเขตการทำงาน เช่น ระบุชื่อไฟล์	30–60% ของ input ต่อ run	ต่ำ
ทำ memory file ให้สั้นและเสถียร	5–15% ต่อ turn	ต่ำ
ใช้ `/compact` หรือ `/clear` ระหว่างงาน	40–80% สำหรับ session ยาว	ต่ำ
Prompt caching สำหรับ prefix ที่เสถียร	ประมาณ 90% สำหรับส่วนที่ cache	ปานกลาง
Route งานง่ายไปโมเดลราคาถูก	50–80% สำหรับงานย่อยนั้น	ปานกลาง
ลด/กรอง tool output	20–50% สำหรับ run ที่ใช้ tools เยอะ	ต่ำ
อ่านเฉพาะส่วนของไฟล์	70–95% สำหรับไฟล์ใหญ่	ต่ำ
จำกัด retrieval/RAG	30–60% สำหรับ agent ที่ใช้ retrieval มาก	ปานกลาง
วัดต้นทุนต่อ run	ไม่ลดโดยตรง แต่ทำให้ optimize ได้จริง	ต่ำ

ตัวเลขเป็นช่วงโดยประมาณ ผลจริงขึ้นอยู่กับ workflow และความสิ้นเปลืองตั้งต้นของแต่ละทีม

Checklist สำหรับนำไปใช้ทันที

เริ่มจากรายการที่ทำครั้งเดียวแล้วได้ผลทุก run:

[ ] ลด CLAUDE.md ให้เหลือเฉพาะ command, rule และ pointer
[ ] เพิ่ม ignore สำหรับ generated files, build output, coverage, dependencies
[ ] เปลี่ยน test command ให้ silent/summary
[ ] ใช้ git diff --stat ก่อน full diff
[ ] สั่ง agent ให้อ่านเฉพาะ function หรือไฟล์ที่ระบุ
[ ] ใช้ /compact เมื่อ session ยาว
[ ] ใช้ /clear เมื่อเปลี่ยนงาน
[ ] ตั้งโมเดลราคาถูกเป็น default สำหรับงานง่าย
[ ] เปิด prompt caching ถ้าเรียก model API เอง
[ ] log token usage หรือต้นทุนต่อ task

บทสรุป

ค่าโทเค็นของ CLI coding agents ลดได้โดยไม่ต้องเปลี่ยนคุณภาพของงาน จุดที่ควรจัดการคือบริบทที่ส่งซ้ำ, output ที่ไม่จำเป็น และการใช้โมเดลแพงกับงานที่ไม่ต้องใช้ reasoning สูง

เริ่มจากสิ่งที่ง่ายที่สุด: scope งานให้แคบ, ทำ command output ให้เงียบ, ลด memory file และ clear session ระหว่างงาน จากนั้นค่อยเพิ่ม prompt caching, model routing และ cost tracking เพื่อให้การประหยัดวัดผลได้จริง

Google I/O 2026 Dev Keynote: Recap

Khairunnisaas — Wed, 20 May 2026 08:43:17 +0000

In case you missed it, here's a recap of what Google announced at I/O today. Some of these are updates, some are brand new. Let's get into it.

The Antigravity Ecosystem

Antigravity 2.0

Google officially introduced Antigravity 2.0, the next evolution of its AI development platform.

Think of it as a mission control center for AI agents. Instead of just chatting with an AI assistant, Antigravity lets developers orchestrate multiple agents, tools, workflows, and cloud resources in one place. It’s designed for teams and enterprises that want to build agent-powered applications at scale.

For enterprise users, Antigravity connects directly to your Google Cloud project and automatically follows the same security rules, permissions, and policies your organization already uses.

Google is basically trying to turn AI agents into actual coworkers.

Antigravity IDE
Because Antigravity is now an agentic platform, Google needed a separate place for the code editor experience. That's the Antigravity IDE. It's basically what the original Antigravity app was.

Honestly, I'm not sure why Google didn't just make two modes in one app. But okay.

Antigravity CLI
From what Google showed during the keynote, this looks like a revamped replacement for Gemini CLI.

Google didn’t fully explain the differences on stage, but the biggest clue is this:

Google already published a migration guide for moving from Gemini CLI to Antigravity CLI, which strongly suggests Gemini CLI will eventually be deprecated.

If you want to read more about the transition, Google posted the official update here: Google’s migration update for Antigravity CLI

I also don't know why they make a new product for this. I mean, why not just updating the Gemini CLI?

Android Development Is Getting More Agentic Too

Google also announced several updates focused on Android development.

Android CLI + Android Knowledge Base
Google officially brought Android support into Antigravity through Android CLI — a tool that prepares agents for Android development workflows.

One of its biggest features is the Android Knowledge Base, which acts as a constantly updated source of official Android developer guidance. This allows AI agents to pull the latest best practices, APIs, and documentation while working on your project.

Google also introduced something called Android Skills — open-source skills that help LLMs better understand Android codebases and execute more complex workflows correctly.

Google's internal team claims this uses 70% fewer tokens and completes tasks 3x faster. Of course, those are Google’s own benchmarks, so we’ll have to wait for real-world testing to see how accurate those numbers are.

Migration Agent — Possibly One of the Coolest Demos
Google gave a preview of Migration Agent — a tool that can migrate your existing app into a native Kotlin Android app. React Native, web framework, or iOS — it doesn't matter. Still in preview, but the concept is interesting.

Web Updates

Modern Web Guidance
Moving over to web development, Google introduced something called

Modern Web Guidance.
This is basically a collection of AI-ready web development best practices and tooling guidance designed specifically for AI agents.

The goal is simple:

Google wants AI coding agents to stop generating outdated web code.

Modern Web Guidance integrates directly with Baseline so agents can understand which web platform features are actually safe and modern to use across browsers.

This is a pretty big deal because one of the biggest problems with AI-generated frontend code right now is that it often recommends outdated APIs or old patterns.

WebMCP
During the Modern Web Guidance demo, Google also introduced WebMCP.

So… what exactly is it?

According to Google’s developer docs:

“MCP is a proposed web standard to help you build and expose structured tools for AI agents.”

In simpler terms:

WebMCP lets websites explain to AI agents how they should interact with them.

Imagine visiting a complicated dashboard or settings page and instead of manually configuring everything yourself, you could simply tell Gemini:

“Set this up for me.”

And the AI would understand how to navigate and interact with the site properly.

That’s basically the future Google is aiming for.

WebMCP is still experimental for now, and Google announced that the experimental MCP APIs will begin trials in Chrome 149.

HTML-in-Canvas API
Last but definitely not least, Google announced the new HTML-in-Canvas API.

This API allows developers to place real DOM elements directly inside a canvas.

That might sound small, but it actually solves one of the biggest limitations of canvas-based apps:

accessibility and interactivity.
Because the elements are real DOM nodes:
They’re searchable
Accessible
Selectable
Compatible with browser features
Better for SEO
Easier to interact with

This could become huge for browser games, design tools, editors, and highly interactive web apps.

That's everything from the developer keynote. The general direction is clear — Google is going all-in on agents. Antigravity is the center of that story.

Quantitative Content Methodology: 5-Layer Content Framework

Gülşah Arslan — Wed, 20 May 2026 08:43:09 +0000

Quantitative Content Methodology (QCM) treats content not as mere text, but as a mathematical dataset optimized for search engines and LLMs. In this guide, we explain the 5-layer content framework applicable to any topic, step-by-step.

Key Takeaways
• QCM builds pages based on semantic vectors, information density, and probabilistic word distribution.
• An entity pool is extracted prior to production; content is fed from this pool rather than through random word selection.
• An information density budget is defined for each section—targeting at least 2.5 verifiable data points per 100 words.
• The first sentence under every H2 heading serves as an "atomic answer"; it remains meaningful even when extracted from context by an LLM.
• JSON-LD schemas (FAQPage, HowTo, Dataset) present content to search engines as variable-value pairs.

Ranking on the first page is no longer enough. Generative search engines like Google’s AI Overviews, ChatGPT, and Gemini exclusively cite structured, high-information-density pages as sources when generating direct answers to user queries. QCM is an content production framework designed for this new reality.

The 5 layers below represent the methodological steps to be applied at every stage of the production process. We will use "Core Web Vitals Optimization for E-commerce Sites" as our example topic, though the skeleton is adaptable to any industry.

**The 5 Layers of QCM
**Each layer builds upon the previous one. Skipping steps diminishes the effectiveness of subsequent stages.

1. Semantic Vector Map

Before writing, the main entity (core concept) and sub-entities with vectorial proximity are identified. Embedding models (BERT, Sentence-BERT) measure word proximity using cosine similarity. If the content is written while maintaining this cluster distribution, the page signals that it "covers the entire topic."
Layer Entity Proximity Target Frequency
Core Core Web Vitals 1.00 8–12
Primary LCP, INP, CLS 0.85–0.92 4–6
Secondary TBT, TTFB, FCP 0.70–0.80 2–3
Contextual e-commerce, conversion, cart 0.55–0.65 1–2
Authority PageSpeed, Lighthouse, web.dev 0.50–0.60 1–2
Recommendation: Define at least 15 entities for a topic. Fewer leads to superficial content; more leads to topic dilution.

2. Information Density Budget

The minimum concrete information unit—a figure, threshold, procedure, or definition—required per 100 words is pre-calculated per section. This approach prevents the "empty paragraph" syndrome and increases the Information Gain ratio.
• Target Information Gain: At least 1.3x higher than the average of the top 10 competing pages. That is, 30% more verifiable data per 100 words than the competition.

3. Probabilistic Word Distribution

The frequency and placement of key terms are pre-determined. A mathematical balance is established between over-repetition (keyword stuffing) and under-repetition (semantic weakness) based on TF-IDF and BM25 targets.
Important Positioning Rules:
• The core term must appear in the H1, H2, and both the first and last 100 words.
• Primary terms must be positioned in at least one H2 heading.
• Contextual terms should appear 1–2 times within the natural flow without feeling forced.
• Natural readability always takes precedence over frequency targets. These targets are ceilings, not mandates.

4. Structural Skeleton (LLM-friendly layout)

To enable LLMs and AI Overviews to cite content directly, each section is structured as a question-answer atom. The answer is completed in the first sentence; justifications follow in subsequent sentences.
• Atomic Answer Rule: The first sentence under each H2 contains the independently readable answer to the query. Even if an LLM extracts that sentence alone, the information remains accurate and complete.

5. JSON-LD Schema Layer

This structure explicitly notifies Google and LLMs of the page’s mathematical clarity. JSON-LD schemas present information as variable-value pairs. Google bots no longer ask "what is this about?"; they reach the clarity of "The answer to Question A is B."
Key Schemas used in QCM:
• Article: Author, date, publisher info (Mandatory for E-E-A-T signals).
• FAQPage: Each question-answer atom in the FAQ section (Direct candidate for AI Overviews).
• HowTo: For sequential procedures (e.g., LCP reduction steps).
• Dataset: Structured markup for numerical thresholds and tables.
• BreadcrumbList: Page position in site architecture (Critical for topic clusters).
Pre- and Post-Production Audit
Before writing, the following must be answered numerically:
• Has the entity pool been extracted? (Min. 15 entities)
• Is the information density goal set for each section?
• Has the average data point count of the top 10 competitors been measured?
• Is the target Information Gain ratio defined? (1.3x recommended)
Post-publication verification metrics:
• Semantic Coverage: ≥ 85% (via InLinks / Surfer SEO)
• Information Density: ≥ 2.5 (verifiable data / 100w)
• Schema Accuracy: 0 errors (via Rich Results Test)
• LLM Source Test: Top 3 source verification (via ChatGPT / Gemini)
To apply this methodology to your own site and produce content that is shaped by data and speaks to AI, feel free to get in touch.