What I learned about AI agents while setting one up

Confirmation is not completion.

What I learned about AI agents while setting one up

The current generation of AI agents cannot support persistent, autonomous systems because they lack stable memory, deterministic execution, reliable models, and reliable verification mechanisms.

***

My thumbs hurt. I had been copy-pasting code between two phones and a laptop through most of the day for a few days via WhatsApp and I had been on the verge of giving up several times.

One phone ran a Picoclaw agent on a cleaned up old OnePlus 7 that I was trying to configure, while the other two devices intermittently ran ChatGPT for advice. I copied a block of code from ChatGPT, pasted it into WhatsApp, copied the code from WhatsApp into Termux, a command line interface on Android, or Acode, a code editor for Android. I restarted the agent, tested it, watch it fail, copypasted the output, went back to the other phone or laptop, copied into the chat window, got new instructions - sometimes fixes, sometimes new code to check the EXACT error, and repeated the process. For days. At one point I thought my vision was getting blurred, so I took a break.

Picoclaw, then at an early version of 0.22 (and now at 0.24), so still under development, is a lightweight AI agent based on Nanobot, but it comes with security settings, and is small enough to run on an old Android phone. I don’t have a Mac Mini for an OpenClaw, so this was easier.

The process of setting this up — and I’m still not done, and the Picoclaw still fails — taught me a lot about how AI and AI agents work (or don’t)

The system I was trying to build

I started by using ChatGPT because I kept hitting my rate limits on Claude, and I thought this would be faster. ChatGPT, however, took me down “just say yes, and we can make this even better” trap that it defaults to improve its engagement metrics and keep you hooked. I knew what was going on, but I thought — what the hell, lets see if this works — and went down that never-ending rabbit hole.

Somewhere down the line, ChatGPT sold me the dream of making a completely autonomous operating system setup with Picoclaw, which would run continuously on its own schedule, accumulate knowledge over time, learn from both what I say, how I react, and from the web, and give me capabilities I don’t have on my own.

The systemic requirements that emerged:

  • A persistent memory system organised around what is active, ongoing, available for reference, and complete, so the agent always starts from an accurate picture of what exists rather than from scratch. Notes would be filed using Tiago Forte’s PARA method, which I use for my personal notes.
  • A knowledge graph that connects ideas automatically: an observation from one week linked to something filed three months earlier, without my doing the linking.
  • Agent independent knowledge base: The whole system in plain text (.md) files, moveable to any device by copying one folder, with no database and no cloud dependency. I also wanted a setup wherein I could switch from Picoclaw to an OpenClaw (or now Hermes), or run multiple agents on the same knowedge base. I currently run Claude Cowork and Code (rate limits, hello) and Codex on the same knowledge base.
  • A project and task management layer but where creating a project automatically generates the and task list, and where the day’s work gets populated each morning without my initiating it
  • A scheduler running its own routines: planning cycles, daily logs, archiving, backups, without my presence
  • Self learning: A weekly self-learning cycle where the agent reads what it has done, extracts patterns about how it performed, and updates its own behaviour for the following week
  • A heartbeat checking the system every 30 minutes and running lightweight maintenance without any input

Note: Detailed exhibits of one version of this planned system are at the end of this post.

The aim was a system that could think alongside me over time, proposing what needed attention, connecting what I had been working on, and improving how it worked based on what it observed. Something that learned from stated and unstated intent and augmented what I could do alone.

I am, of course, not a developer, and I chose to be led down the garden path by AI tools, because I was curious about how this worked.

How AI kept failing

A couple of days into trying to set this up, ChatGPT’s responses became horribly slow on the laptop app: its verbosity meant that the context window was so long, that the app kept freezing. It worked a little better on the web, but that was not the main problem because I was parallel processing — working, while waiting for ChatGPT to repond.

The real problem was that OpenAI was progressively forgetting what we had built, and suggestions contradicted earlier decisions, or brought the same decision back to me. Things we had tried and failed at, or rejected kept being proposed as new ideas. I realised this especially when it suggested the PARA method as an improvement, after we had already set something like that up. Scrolling back to find what we had agreed on three hours earlier froze the browser. Scolding it led to an apology, but that didn’t mean it remembered the context.

Finally, thumbs hurting (it took occasional breaks after this), I asked it to create a handover document and created a project in Claude. Tokens running out in Claude meant I’d be forced to take a break from this maddening exercise.

Claude did better and gave me hope: it gave me code to execute several steps at once, writing both file edits and folder structure changes into a set of executable instructions in a .sh file, and a structure with files that would replace old ones.

After a while, the same pattern returned…Claude would lose track of the system state it was supposed to be maintaining. Things I had already configured got overwritten. The agent was connected to Telegram on my primary phone, so I knew when something was off. I would copypaste the Telegram output into Claude, alongwith code that was visible on the phone running the agent. The agent, on Telegram had responses to questions nobody asked, like the setup of some hardware that the makers of Picoclaw also sell, tasks triggered at the wrong time, commands producing outputs with no relationship to what the command was supposed to do. At times when I would give it a task, it would execute it, and at others, it would return a chatbot-like response instead of running the task.

When I switched the underlying AI model from GPT-4o (cheapest) to Gemini 2.5 Flash, things improved, but Claude would still often blame continued failures on the model rather than examine the fact that modified instructions were causing it to fail. At times it would execute the task when I pointed out that it had executed the same task earlier. Go figure.

Where things stand, and what I learned

I gave up on the custom configuration and installed the default Picoclaw 0.24. It appeared to work better for a while. I spent some time connecting it to Clawhub, because it has that ability, in order to enable some skills — stock prices, weather updates, web-search etc. It misfires all the time.

Reminders are not executed, some reminders pop up at the wrong time, some repeatedly. Some are stored as daily reminders when they’re supposed to be one-time. It’s running, but I don’t use it because I can’t rely on it, and Claude itself is more reliable.

I’m glad I went through this pain because it will probably make me appreciate an agent that actually works, and I learned a lot along the way about both AI and Agents. Here goes:

1: You can’t build a system that an agent is not capable of executing: What neither ChatGPT nor Claude helped me understand, and what I only pieced together much later, was that the system I was trying to build required a procedural execution layer that neither the agent framework nor AI instruction could reliably provide. I only figured this out after I gave Claude Picoclaw’s github repository to understand its limitations.

2: An agent is significantly limited by the AI model you use, and every model fails sometimes, so every agent fails sometimes: Agents are going to remain unreliable because AI models are unreliable. GPT-4o could not write files reliably. Switching to Gemini 2.5 Flash changed the behavior profile of the same instruction files. The same configuration produced different results depending on the model. An agent system is not meaningfully portable across models. You can run a better model (say a Kimi 2.5) via the web but it is still going to fail sometime.

3: Reliability is expensive but never absolute: Running an agent is expensive because running tasks requires more capacity, so failure is expensive. A corollary to that is that failures can compound, but you pay tokens for every failure when it’s an operational expense. A fixed cost, like buying a Mac Mini or something with a GPU, means that you restrict failure to time lost, or things being messed up.

4: Persistence is expensive: Agents are also expensive to run also because they fail often and are are persistent: they keep trying, and every try costs tokens and hence money.

5: An AI confirming it completed a task is not evidence that the task was completed. I asked the agent to create a project and received a detailed confirmation. Nothing was on disk. It lied.

6: Defining desired behavior in configuration files does not cause an agent to follow it. I built detailed workflow definitions, mechanisms for routing commands and decisions across a file and folder structure, storage across memory registries, planning cycles. Much of this got ignored. The appearance of correctness of the system made it harder to look for what was actually wrong.

7: Agents sometimes default to the simplest available action: A command to create a project consistently produced a single text file with the project name. That was the simplest response that could be called “creating a project.” Everything that required more work -- generating a workspace, updating a registry, creating a task list -- got skipped. The workflows can describe behaviour, but that doesn’t mean it will be followed.

8: Silent failures are harder to work with than visible errors: I got no error messages for failures to write to disk, but the agent confirmed success. I spent much of my time crosschecking the agent’s claims.

9: Vague instructions are difficult to resolve: An “Update the registry” message produced a paragraph describing the update. It didn’t lead to a registry update. Only instructions specifying every individual step, in sequence, with no gaps for interpretation, had any chance of the registry actually being updated.

10: Verification has to be part of every procedure, not added when something seems wrong. Verifications also need to be visible: Every time the agent confirmed an action without being required to check the result, there was a false confirmation message. When the system ran on its schedule overnight, silent failures produced no record. The morning came and the planning cycle had run but nothing had been written. With no log and no notification, there was no way to know the work had not happened. Again, this is as much of an agent process problem even though it’s first an AI model problem.

11: You cannot build a self-learning system before you have built one that can execute a basic task: I shouldn’t have gone down that rabbit hole of building an autonmous self-learning system with an evolving knowledge graph and a weekly-reflection cycle, while the agent (i.e. the AI model) still could not reliably create a file.

12: Context reboots: Both the agent and the AI assisting with the setup share a core limitation that neither maintains genuine persistent state across sessions. Both work from what is in the current context, and both treat each session as, in some sense, starting over. I was trying to build a system with persistent intelligence, but having that setup doesn’t mean memory gets stored, and even if stored, gets retrieved, despite instructions.

13: Memory fades, degrades and does not recover: After several days with OpenAI, the page was hanging and the model was forgetting what we had built. It suggested approaches that had already failed. The length of the conversation was causing it to fail

14: When asked to fix one thing, an AI will often change other things that were working: Claude would fix the problem I had identified and rewrite surrounding logic that was correct. Each unnecessary change introduced a new failure point. Claude is trained to be helpful, and helpful tends to mean that it keeps trying to improve a system even if it doesn’t need improving. Because memory fails, it makes changes it doesn’t need to.

The last one, and this should tell you as much about agents as it does about me:

15: The setup is never complete: Each fix created a new problem, and each new version of the tool changes sometimes. There is bound to be constant experimentation and optimisation.

What next? I don’t think I’m going to wait for Picoclaw version 1.0. I’ll probably play with each version leading up to it, but the lack of reliability has kinda shaken my confidence in it to actually create a dependency. I will try Kimi 2.5 next, using firebase or OpenRouter to limit my costs, to see if the current setup behaves better. I’ll set up Picoclaw Android and a Picoclaw on a Raspberry Pi. I will try giving it its own Vercel and Github accounts (there’s a mobile number and email address for my agent already btw).

Someone said that vibe-coding is like playing an RPG, but so is running agents: I’m one of those who keeps trying to upgrade himself, so this is not necessarily a bad thing for me. It just gives me a system to obsess over forever.

*

Some Exhibits from the handover document

Exhibit A: Memory System

PARA structure:memory/areas/ — Ongoing life domains (never finish): reasoned.md, medianama.md, kid.md, health.md, stocks.md, books.md, travel.md, ai_products.md, vibe_coding.md, daily_log.md, monthly_log.mdmemory/projects/ — Active project workspaces with 5-file structurememory/resources/ — research/, ideas/, questions/memory/inbox/ — Temporary capture. Reviewed weekly.memory/archive/ — tasks/YYYY-MM/, research/, projects/, events/memory/experience/ — Weekly lessons from reflector

Memory rules (three tests before storing anything):Novelty — does this add something not already stored?Usefulness — will this be referenced again?Durability — will this still be relevant in a month?Self-Learning System

Every Sunday at 20:00 the reflector runs:Reads tasks/done/ for the weekReads memory/areas/daily_log.mdExtracts patterns and lessonsWrites to memory/experience/[date]_weekly.mdUpdates state/system_summary.md

state/system_summary.md is the persistent intelligence layer. Read at every session start. Contains: current focus, active projects, known issues, what is working. The reflector owns this file — no other process should overwrite it.Portability Requirements

The system must be deployable on:Android phone running TermuxRaspberry PiLinux VPSAny device running a claw-family agent

What varies between deployments:File tool names (write_file vs create_file — adjust AGENTS.md)Whether write_file auto-creates directories (if yes, exec mkdir steps can be skipped)Cron job JSON schema (adapt to runtime)Heartbeat file location and trigger mechanismSpawn support for async subagents

What never changes:The aiops/ directory structureThe file formats (YAML and Markdown)The procedures (task creation, project creation, research)The memory systemThe monitoring jobs and Telegram user ID

Cloud storage goal: The aiops/ directory should eventually be stored in a cloud location (Google Drive, GitHub, or similar) and pulled down on deployment. This makes the system device-independent. The agent binary is separate from the workspace — only the workspace needs to be synced.Exhibit B: Directory StructureThe Five Control Files

These are loaded by the agent runtime at startup. They define everything.

AGENTS.md — The most important file. Contains routing logic and complete step-by-step procedures for every major action. Must be explicit: not “add to today.md” but “edit_file tasks/today.md, find the exact text ‘## Pending’, replace with ‘## Pending\n- [ ] [HH:MM] — [title]’”. Vague instructions produce narration, not execution.

HEARTBEAT.md — What runs every 30 minutes autonomously. Organised into: system health tasks (every cycle), daily tasks (once per day), weekly tasks (once per week). Long tasks must use spawn to avoid blocking the loop.

TOOLS.md — Reference document for tool names and usage rules. Keeps AGENTS.md focused on routing and procedures.

USER.md — The owner’s profile. Life areas, professional context, communication preferences. This is the only place owner-specific details live. The rest of the system reads from it but does not duplicate it.

IDENTITY.md / SOUL.md — Agent character and values.