What we learned letting an AI agent review and write our code

Storyie Engineering Team
10 min read

After running Claude Code Action in production for a month — 13+ merged PRs, nearly 100% merge rate — we break down what works, what fails, and how the human role shifts when an AI agent is writing most of your code.

What we learned letting an AI agent review and write our code

Storyie uses a GitHub Actions workflow where labeling an issue triggers Claude Code Action to propose a plan, implement it, and open a PR. We have been running this in production for about a month. In that time Claude has opened over 13 PRs — UI changes, bug fixes, refactors, feature additions — with a merge rate close to 100%.

This post is about the operational reality of that setup: what works, where it breaks down, and how the day-to-day feel of solo/small-team development changes when the agent is doing most of the typing.

TL;DR

  • Issue quality is the single largest lever on output quality. Concrete scope + explicit constraints + a clear done condition reliably produces a near-mergeable first draft.
  • The plan → do two-step is worth the overhead. Catching a bad direction at the plan stage is free; catching it after the PR is open is a wasted CI run.
  • CLAUDE.md coding guidelines are not optional. Without them the agent drifts on conventions (type safety patterns, commit format, tool choice) in ways that generate real review friction.
  • Investigation issues ("figure out why X is slow") consistently underperform. Use interactive Claude Code locally for those instead.
  • The human role shifts from writing code to writing issues and reviewing logic.

| Phase | Who does it | Time per PR |
| ------------- | ---------------- | ----------- |
| Issue writing | Human | 5–10 min |
| Plan | Claude (CI) | 2–8 min |
| Plan review | Human | 1–3 min |
| Implementation | Claude (CI) | 5–15 min |
| PR review | Human | 5–10 min |
| Total | | ~20–45 min |

What Claude has been shipping

To make this concrete, the PRs from the past month fall into a few buckets:

  • UI changes: Expo Router tab reorganization, floating action button icon swap
  • Bug fixes: note detail screen layout, auth request handling
  • Refactors: removing unused components, reformatting table displays
  • Features: filtering public diaries, preventing duplicate usernames

Most PRs touch 1–5 files. We have not used this workflow for large architecture changes — and based on what we have seen, we would not.

What makes it work

Issue quality determines output quality

This is the most important thing we learned, and it is also the most obvious in retrospect. Claude Code Action is not creative — it does what the issue says. If the issue is precise, the PR is precise. If the issue is vague, the PR interprets it in some direction that is plausible but probably not quite right.

Issues that produce clean PRs share a pattern:

  • Specific scope: "Change the icon on the float button from the current chevron to a plus sign" rather than "improve the bottom action area."
  • Stated constraints: "Do not modify existing tests." "Preserve type safety on the return value." "Do not change the API surface."
  • A verifiable done condition: "pnpm type-check passes." "The button appears in the screenshot."

The time spent sharpening an issue before filing it pays back at review time. When the scope is clear, the diff is clean and the review is fast.

The plan → do two-step catches mistakes early

The workflow runs two separate jobs. The "plan" job reads the codebase and posts a proposed implementation approach as a comment. We look at it before proceeding. If the plan scopes too broadly, we split the issue. If the approach is wrong, we say so in a comment and the next run corrects. If it looks right, we flip the label to "do" and the implementation runs.

In practice, this catches roughly one redirection per five issues. That might seem low, but each redirect that happens at the plan stage instead of after the PR is opened saves a full re-run plus the cognitive cost of reviewing a diff you are about to throw away.

Automatic issue splitting for oversized work

The workflow includes logic that triggers when an issue is too broad — specifically when it contains multiple independent features, spans more than two packages, or would require more than ten implementation steps. In those cases Claude creates smaller child issues and closes the original.

We did not expect this to work as well as it does. It turns out the agent is reasonably good at estimating its own scope. Issues that would have been messy multi-PR monoliths have consistently been split into clean, independently-mergeable pieces.

Where it breaks down

No CLAUDE.md means convention drift

We started without a CLAUDE.md coding guidelines file and saw the same problems repeat:

  • Using as any in test mocks, which Biome flags immediately
  • Running mkdir to create directories when the Write tool creates parent directories automatically
  • Inconsistent commit message formats

Once we added CLAUDE.md with explicit patterns, these stopped. The most important addition was spelling out the mock type pattern:

// This: Biome rejects it
mockQueries.getData.mockResolvedValue(mockData as any);

// This: what CLAUDE.md specifies
mockQueries.getData.mockResolvedValue(mockData as unknown as never);

It is a small detail, but it was generating linting failures on nearly every test-touching PR before we documented it. The broader point is that anything the agent cannot infer confidently from the codebase alone should be written down explicitly.

Hosted CI runners are too slow

The first few weeks we ran on ubuntu-latest. Installing dependencies on a cold runner takes long enough that timeouts happened occasionally, and the feedback loop felt sluggish.

Moving to a self-hosted runner (a Raspberry Pi) changed this significantly. The node_modules cache persists between runs, so the agent gets to actual work almost immediately. For a workflow that runs multiple times per day, the cumulative difference is substantial.

Investigation issues do not work well

"Figure out why the auth request is sometimes failing" — issues like this reliably produce low-quality output. Claude can read the code and speculate about causes, but it cannot observe a running system, measure actual latency, or reproduce device-specific behavior.

The better approach: use Claude Code interactively in a local terminal for investigation. The interactive loop (read → run → observe → read more) is what investigation requires. The CI workflow is optimized for implementation, not diagnosis.

How the human role changes

What you stop reviewing

With CLAUDE.md and Biome, formatting, import ordering, naming conventions, and indentation are handled. Reviewing those things in PRs is no longer a use of time.

What you review more carefully

Three things get sharper attention:

  1. Business logic intent — Claude reads code accurately but can misread intent. "Update the record's status" might mean updating only the current record or all records in a set, depending on context that is in your head and not in the issue. This is where human review earns its keep.
  2. Scope creep — the agent has a mild tendency to refactor things it notices while working on nearby code. Changes outside the stated scope go back.
  3. Fit with existing patterns — the agent picks up project conventions from CLAUDE.md and the surrounding code, but occasionally lands on a pattern that is technically fine but not how this codebase does things.

Issue writing becomes the primary skill

The time that used to go to implementation now goes to issue writing. And this turns out to be a net positive for the project. Writing a precise issue forces you to understand the problem fully before any code is touched. It creates a record of intent that is useful for future reviewers. And it naturally pushes toward smaller, more coherent changes — because a vague issue is painful to write precisely, which surfaces the smell early.

Practical details that matter

Locking down tool permissions

The --allowedTools argument controls what Claude can do in each job phase:

# plan job: read-only, issue operations only
claude_args: |
  --allowedTools "Read,Glob,Grep,LS,Bash(gh issue comment:*),Bash(gh issue view:*)"

# do job: write access, scoped to project operations
claude_args: |
  --allowedTools "Read,Edit,Write,MultiEdit,Glob,Grep,LS,Bash(pnpm:*),Bash(git:*),Bash(gh pr create:*)"

The plan job has no write permissions at all. This prevents an entire class of incident where the agent starts modifying files during the planning pass — and it also produces better plans, because the agent cannot hedge by "just trying a change to see if it works."

@claude mentions in PR review

Reviewer comments prefixed with @claude trigger the agent to apply the requested change. "Rename this variable," "add a test for the empty case," "extract this into a helper" — these resolve in minutes without leaving the PR. It keeps review conversations tight and avoids the round-trip of filing a new issue for small follow-ups.

Build packages before the agent runs

In a monorepo, shared packages need to be built before the application that depends on them. Without a pnpm build:packages step before the agent's CI job, type resolution fails and the agent spends turns diagnosing build errors that have nothing to do with the task. One line in the workflow configuration eliminates this entirely.

The numbers

After one month:

| Metric | Value |
| --- | --- |
| PRs opened by Claude | 13+ |
| Merge rate | ~100% (with review-requested revisions) |
| Time to write issue | 5–10 min |
| Time Claude runs (plan + do) | 5–15 min |
| Time to review and merge | 5–10 min |
| Best PR | Expo Router tab reorganization — 3 files, exactly right |
| Hardest PR | Auth investigation — environment-dependent, not reproducible in CI |

Tasks that previously took 30–60 minutes of focused implementation time now take the time it takes to write an issue.

What we would do differently

Start with CLAUDE.md. The convention drift from running without it generated more review friction than everything else combined. Document the non-obvious patterns first — type casting in tests, commit format, tool preferences — before filing the first issue.

Do not expect the agent to investigate. Any issue that starts with "figure out" or "diagnose" or "find out why" should be a local interactive session, not a CI job.

Keep issues small. The automatic splitting helps, but it is better to file small issues in the first place. A PR that touches one concern is easier to review than a PR that touches three, and the agent produces cleaner diffs when the scope is narrow.

Takeaway

The framing that fits best: this is not about having an AI write code so you do not have to. It is about shifting from implementation-mode to design-and-review-mode, and getting a collaborator that handles the mechanical translation from intent to code with reasonable fidelity.

That shift requires investment up front — in CLAUDE.md, in issue-writing discipline, in workflow configuration. Once those pieces are in place, the throughput per hour of human attention increases substantially, and the nature of that attention changes toward the parts of engineering that benefit most from human judgment.

Related Posts

Try Storyie

Storyie is the diary app this workflow is building. If you are curious what falls out of AI-assisted solo development on the user side, the app is at storyie.com and on iOS.