Deep Dive

How I Run a 24/7 AI Agent for Under $5/Month

Published February 18, 2026 · 22 min read · By Jesse

TL;DR: I run 9 brands, 15 cron jobs, 40+ daily automated actions, and a 24/7 AI chief of staff for under $5/month in API costs. This post is the full architecture breakdown so you can replicate it.

The Problem: AI Agents Are Expensive

I burned $450 in 3 days on Anthropic's Claude Opus.

Let me say that again. Four hundred and fifty dollars. In three days. Running an autonomous agent that was managing social media posts, scanning prediction markets, and generating content across multiple brands.

The math was brutal. Opus costs $15 per million input tokens and $75 per million output tokens. An autonomous agent that runs 24/7 — checking cron jobs, responding to messages, spinning up subagents, running heartbeat health checks — burns through tokens like a V8 engine burns through premium gas. Every 60-minute heartbeat check. Every social media posting run. Every memory sync. It all adds up.

I looked at the alternatives:

I needed something different. Not a cheaper model. A cheaper architecture.

The Solution: Multi-Provider Fallback Architecture

The core insight is simple: not every AI task requires the same model.

When I'm chatting with my agent interactively — debugging a deployment, brainstorming a strategy, reviewing code — I want the best model available. That's Anthropic's Claude Sonnet. It's smart, fast, and follows complex instructions well. This is worth paying for.

But when a cron job fires at 8 AM to post tweets across 7 accounts? When a heartbeat check pings every 60 minutes to confirm the agent is alive? When a subagent spins up to generate TikTok hooks or verify Amazon affiliate links? None of that needs Sonnet. It needs a model that can follow instructions, use tools, and not hallucinate. Google's Gemini Flash does that for free.

The architecture looks like this:

+-------------------+ | Mac Mini M4 Pro | | (always-on) | +--------+----------+ | +--------v----------+ | OpenClaw | | Agent Gateway | +--------+----------+ | +--------v-----------------------------+ | Model Router | +---------------------------------------+ | | | Interactive Chat: | | Anthropic Sonnet [$$$] | | | | Cron Jobs / Subagents: | | Google Gemini 3 Flash [FREE] | | | | Heartbeat Health Checks: | | Google Gemini 2.5 Flash Lite [FREE]| | | | Fallback Chain: | | Groq Llama 4 Scout [FREE] | | Mistral Small [FREE] | | Synthetic Kimi K2.5 [FREE] | | Synthetic GLM-4.7 [FREE] | | Ollama Qwen 32B [LOCAL] | +---------------------------------------+

The key word is routing. Every request gets routed to the right model for its job. Expensive model for expensive tasks. Free model for everything else. And if a free provider is down or rate-limited, the request cascades through a fallback chain until something works.

The Stack

Hardware: Mac Mini M4 Pro

I chose the Mac Mini M4 Pro for a few reasons:

You don't need a Mac Mini. A cheap Linux box, an old laptop, even a Raspberry Pi 5 could work if you skip the local model fallback. The agent gateway itself is lightweight — it's the LLM inference that's resource-hungry, and we're offloading that to cloud APIs anyway.

Software: OpenClaw Agent Gateway

OpenClaw is the orchestration layer. It handles:

The entire config lives in a single JSON file. Here's the core structure (keys redacted):

{
  "auth": {
    "profiles": {
      "anthropic:default": {
        "provider": "anthropic",
        "mode": "api_key"
      },
      "google:default": {
        "provider": "google",
        "mode": "api_key"
      },
      "groq:default": {
        "provider": "groq",
        "mode": "api_key"
      },
      "mistral:default": {
        "provider": "mistral",
        "mode": "api_key"
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-sonnet-4-5",
        "fallbacks": [
          "google/gemini-3-flash-preview",
          "groq/meta-llama/llama-4-scout-17b-16e-instruct",
          "mistral/mistral-small-latest",
          "synthetic/hf:moonshotai/Kimi-K2.5",
          "synthetic/hf:zai-org/GLM-4.7"
        ]
      }
    }
  }
}

That fallbacks array is doing most of the heavy lifting. If Anthropic's API is down, the agent transparently falls through to Gemini, then Groq, then Mistral, then Synthetic providers, then local Ollama. The user (me, chatting over WhatsApp) never notices. The agent just keeps working.

Model Routing: The Key Insight

This is the part that saves 95% of the cost. Let me walk through each routing tier.

Tier 1: Interactive Chat — Anthropic Claude Sonnet ($3/M in, $15/M out)

When I message my agent over WhatsApp or through the web gateway, that's a real-time conversation where I need the model to understand nuance, follow complex multi-step instructions, and give me high-quality responses. Sonnet handles this.

But here's the thing — I'm not chatting with my agent 24 hours a day. Realistically, I send maybe 20-50 messages on a busy day. Some days zero. At Sonnet's pricing, that's maybe $3-5/month in interactive chat costs. That's the entire paid portion of my stack.

Tier 2: Cron Jobs — Google Gemini 3 Flash (FREE)

Every automated task — social media posting, prediction market scanning, content generation, memory sync, cross-engagement — runs on Gemini 3 Flash Preview. Google's free tier is absurdly generous:

For automated cron jobs, Gemini Flash is more than smart enough. It can read a content queue, search the web for breaking news, draft tweets, call posting APIs, and log results. It follows multi-step instructions reliably. Is it as good as Sonnet? No. But for "read this file, post this tweet, log the result" — it doesn't need to be.

Here's what the model override looks like in a cron job config:

{
  "name": "X Morning Posts -- All Brands",
  "schedule": {
    "kind": "cron",
    "expr": "0 13 * * *",
    "tz": "UTC"
  },
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "message": "Morning X posting run. Post 1 tweet per account
               with 3-minute delays. Read content queues, check
               post logs for duplicates, post via x-poster.js.",
    "model": "google/gemini-3-flash-preview",
    "timeoutSeconds": 600
  }
}

Notice "model": "google/gemini-3-flash-preview". That overrides the agent's default Sonnet model for this specific job. The cron fires, spins up an isolated session, uses the free model, does the work, and shuts down. Zero API cost.

Tier 3: Subagents — Google Gemini 3 Flash (FREE)

When the main agent needs to parallelize work — like generating content for 9 brands simultaneously — it spawns subagents. Each subagent gets its own context and can run concurrently. The config:

"subagents": {
  "maxConcurrent": 8,
  "model": "google/gemini-3-flash-preview",
  "thinking": "low"
}

Eight concurrent subagents, all on the free tier. The "thinking": "low" setting reduces reasoning token overhead — subagents don't need deep chain-of-thought for most tasks.

Tier 4: Heartbeats — Google Gemini 2.5 Flash Lite (FREE)

Every 60 minutes, the agent runs a health check. Is the gateway responsive? Are cron jobs firing? Is the filesystem accessible? This is the simplest possible task, so it gets the cheapest possible model:

"heartbeat": {
  "every": "60m",
  "model": "google/gemini-2.5-flash-lite"
}

Gemini Flash Lite is Google's smallest model. It's fast, it's free, and it can answer "are you alive?" 24 times a day without anyone caring about response quality.

Tier 5: Fallback Chain — Groq, Mistral, Synthetic, Ollama (ALL FREE)

If Google's API is down (rare, but it happens), requests cascade through:

  1. Groq — Running Llama 4 Scout on their free tier. Fast inference, decent quality.
  2. Mistral — Mistral Small on the free tier. Good for structured tasks.
  3. Synthetic — A proxy service running open-source models (Kimi K2.5, GLM-4.7). Zero cost, large context windows (up to 256K tokens).
  4. Ollama (local) — Qwen 2.5 32B running on the Mac Mini's hardware. Last resort. Slow but works offline.

I've never actually hit the Ollama fallback in production. The cloud free tiers are reliable enough that local inference is purely insurance.

Cost Breakdown: Real Numbers

Here's what actually costs money and what doesn't.

Component Provider Monthly Cost
Interactive chat (20-50 msgs/day) Anthropic Sonnet $3 - $5
Cron jobs (15 recurring, 40+ actions/day) Google Gemini Flash $0 (free tier)
Subagents (up to 8 concurrent) Google Gemini Flash $0 (free tier)
Heartbeats (every 60 min) Google Gemini Flash Lite $0 (free tier)
Fallback inference Groq / Mistral / Synthetic $0 (free tiers)
Local fallback Ollama (Qwen 32B) $0 (local)
Hosting / compute Mac Mini (electricity) $3 - $5
Website hosting (9 sites) Cloudflare Pages $0 (free tier)
Total Monthly Cost $3 - $5

Compare that to my first week with Opus at $150/day. Same workload. Same outputs. 99% cheaper.

The math works because the vast majority of token usage comes from automated background tasks — cron jobs, subagents, heartbeats — not from me chatting. By routing all of that to free models, the only paid usage is my relatively light interactive conversation.

The Cron System: 40+ Daily Actions on Free Models

The cron system is where this architecture really pays off. Here's a real snapshot of my daily automated schedule:

Recurring Daily Crons

That's 15 recurring cron jobs producing 40+ individual actions per day. Every single one runs on google/gemini-3-flash-preview. Zero API cost.

One-Shot Crons

Beyond recurring jobs, I use one-shot crons for batch operations. These fire once at a scheduled time and self-delete after completion:

{
  "name": "Gumroad: Upload 20 Products (One-Time)",
  "enabled": true,
  "deleteAfterRun": true,
  "schedule": {
    "kind": "at",
    "at": "2026-02-18T19:07:10.000Z"
  },
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "message": "BROWSER TASK: Upload all Gumroad products.
               Use browser profile already logged in.
               Read product listings for titles, descriptions,
               prices. Upload each PDF. Publish. Log results.",
    "model": "google/gemini-3-flash-preview",
    "timeoutSeconds": 1200
  }
}

I queue up a batch of one-shot crons — upload 20 Gumroad products, publish 10 Medium articles, verify 60 Amazon ASIN links, write 15 YouTube scripts — set them to fire 10 minutes apart, and walk away. The agent handles everything autonomously, all on free models.

The "deleteAfterRun": true flag keeps the cron list clean. Fire and forget.

On a typical big batch day, I'll queue 10-15 one-shot crons, plus a failure-checker that runs an hour later to retry anything that broke, plus a final status report cron that sends me a summary over WhatsApp. The entire batch runs on free models. I just read the report when I'm done with my day job.

One-Shot Browser Tasks: The Cost Arbitrage

This is probably the biggest cost optimization I've found, and it's counterintuitive.

The problem: If I chat with my agent and say "hey, go upload these 20 products to Gumroad," that entire conversation happens on Sonnet (my interactive chat model). I'm paying $3/$15 per million tokens for a task that could take 20 minutes and thousands of tokens of back-and-forth as the agent navigates browser pages, handles upload dialogs, and fills in forms.

The solution: Instead of chatting, I create a one-shot cron job with the full instructions baked into the payload message. The cron fires on Gemini Flash (free), runs the browser automation in an isolated session, logs the results, and self-destructs. Same outcome, zero cost.

I use this pattern for everything that involves browser automation:

The key mental shift: don't chat about tasks that can be scheduled. Write the full instructions once, queue it as a cron, and let the free model handle execution.

The instructions need to be thorough — you can't have a back-and-forth conversation with a cron job. But that's actually a feature, not a bug. It forces you to think through the task completely upfront, which means the agent executes more reliably than if you were drip-feeding instructions in a chat.

Cost Controls That Actually Matter

Context Pruning (6-Hour TTL)

LLM costs scale with context length. Every message in the conversation history gets sent with every new request. An agent that's been running for 12 hours accumulates a massive context window, and you're paying to re-send all of it on every turn.

"contextPruning": {
  "mode": "cache-ttl",
  "ttl": "6h",
  "keepLastAssistants": 3,
  "softTrimRatio": 0.7,
  "hardClearRatio": 0.85
}

This config ages out messages older than 6 hours, always keeps the last 3 assistant responses (for continuity), and starts aggressive trimming when context hits 70% of the model's window. At 85%, it hard-clears old context.

The result: context stays lean. The agent doesn't carry around a morning conversation about tweet scheduling when it's doing an evening settlement check.

Memory Flush on Compaction

When context gets compacted (summarized to save tokens), the agent first extracts important information to persistent files:

"compaction": {
  "mode": "default",
  "memoryFlush": {
    "enabled": true,
    "softThresholdTokens": 40000,
    "prompt": "Extract key decisions, state changes, lessons,
               blockers to memory/YYYY-MM-DD.md.
               Format: ## [HH:MM] Topic.
               Skip routine work. NO_FLUSH if nothing important.",
    "systemPrompt": "Compacting session context.
                     Extract only what's worth remembering. No fluff."
  }
}

When the session hits 40K tokens, the agent writes anything worth remembering to a daily markdown file, then the context gets compressed. This means the agent can "forget" the conversation details but still has a persistent memory trail it can read back later.

I also run a memory sync cron twice daily (8 AM and 8 PM) that reads all recent memory files, updates a master MEMORY.md document, and git commits the changes. This creates a continuity mechanism — if the agent loses context entirely, it can reconstruct state from the memory files.

Heartbeat on Cheapest Model

Heartbeats fire every 60 minutes. That's 24 API calls per day, 720 per month. Even on a cheap model, that adds up if you're paying per token. By routing heartbeats to Gemini Flash Lite — the absolute smallest free model available — these 720 monthly health checks cost exactly $0.

Isolated Sessions for Crons

Every cron job runs in "sessionTarget": "isolated" mode. This means it gets a fresh, empty context — no conversation history from previous runs or my interactive chat. This keeps token usage minimal per job and prevents context bleed between unrelated tasks.

What I Learned the Hard Way

1. Ollama Is Too Slow for Agentic Work

I started with Ollama as my primary model. Qwen 2.5 32B on an M4 Pro with 24GB RAM. It works. Kind of. Token generation is ~20 tok/sec. A cron job that takes Gemini Flash 30 seconds takes Ollama 8-10 minutes. And the quality drops — the model skips steps, hallucinates tool names, and can't reliably handle multi-step browser automation.

Local models are a fine backup. They're a terrible primary for anything agentic. Use cloud APIs and save local for offline emergencies.

2. Model IDs Are Annoyingly Specific

I wasted an afternoon debugging why my agent was returning errors from Anthropic. The problem? I had claude-sonnet-4-6 as my model ID. That model doesn't exist. The correct ID is claude-sonnet-4-5. No helpful error message. Just a cryptic 400 response.

Always check the provider's model ID docs. Don't guess. Don't assume the naming pattern. Each provider has their own convention and it matters:

3. Per-Agent Auth Stores vs. Global Config

Early on, I put all my API keys in environment variables. This works until you need different credentials for different tasks. My Kalshi trading bot needs its own API key. My X posting system needs per-account OAuth tokens. My Cloudflare deployments need a separate API token.

The solution was auth profiles — named credential sets that specific agents or cron jobs can reference:

"auth": {
  "profiles": {
    "anthropic:default": {
      "provider": "anthropic",
      "mode": "api_key"
    },
    "google:default": {
      "provider": "google",
      "mode": "api_key"
    }
  }
}

The global config defines the profiles, but each task can specify which profile to use. This keeps credentials organized and makes it easy to rotate keys without touching every job config.

4. systemEvent vs agentTurn for Crons

Cron job payloads have two modes: systemEvent and agentTurn. I initially used systemEvent for everything because it seemed more "autonomous." Wrong.

systemEvent is for system-level signals (restart, config change). agentTurn is for "do this task." Using the wrong one causes crons to silently skip with a confusing error message: "main job requires payload.kind=systemEvent".

Rule of thumb: if the cron should make the agent do something, use agentTurn. If it's a system infrastructure signal, use systemEvent. 95% of the time you want agentTurn.

5. Session Isolation Is Non-Negotiable

My first cron setup used "sessionTarget": "main" for everything. This means every cron job ran in the same session as my interactive chat. The agent's context became a mess — tweet posting instructions mixed with Kalshi trade analysis mixed with memory sync operations. The agent got confused. Quality dropped. Costs spiked because the context window was enormous.

Now every cron runs in "sessionTarget": "isolated". Clean context, clear instructions, predictable results.

6. Write Complete Instructions, Not Conversations

The biggest quality difference between chat-driven tasks and cron-driven tasks is instruction completeness. When you chat, you can course-correct mid-stream. When a cron fires, the agent gets one shot at the instructions and has to figure it out.

My early cron job messages were too brief. "Post tweets for all brands." The agent would miss steps, post in the wrong order, or forget to log results. Now my cron payloads are comprehensive — step-by-step instructions with explicit file paths, fallback behavior, error handling, and logging requirements. Verbose? Yes. But they work on the first try.

Results: What This Stack Actually Produces

Here's what my autonomous agent has shipped in the first two weeks of operation:

All of this for less than the cost of a large coffee per month.

The brands managed: GovConsultPro (government contracting), DevToolCloud (developer tools), HomeOfficeAI (home office gear reviews), ProPlannerStudio (digital planners), OpsDeskAI (AI business automation), IronCladPress (Amazon KDP publishing), WealthUnder30 (Gen Z personal finance), WineWear (lifestyle accessories), and TheOpsDesk (the flagship).

Each brand has its own voice guide, content queue, posting schedule, and cross-engagement strategy. The agent reads these files, executes the daily schedule, logs results, and sends me a status report. I review the report when I have time. Most days, everything just works.

How to Replicate This

Here's the step-by-step setup if you want to build your own version of this stack.

1 Get Your Hardware

Minimum: any machine that can run 24/7. A used Mac Mini, a Linux NUC, an old laptop, a cloud VPS. If you want local model fallback, aim for 16GB+ RAM. If you're fine with cloud-only models, even a Raspberry Pi works.

2 Set Up API Keys (Free Tiers)

Sign up for free API access on each provider:

3 Install OpenClaw

Follow the OpenClaw setup wizard. It will walk you through provider configuration, model selection, and auth profile creation. Point it at your workspace directory where your project files live.

4 Configure Model Routing

Set your primary model to Sonnet (for interactive chat) and add all free providers to the fallback chain. Override the model for subagents and heartbeats to use free tiers:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-sonnet-4-5",
        "fallbacks": [
          "google/gemini-3-flash-preview",
          "groq/meta-llama/llama-4-scout-17b-16e-instruct",
          "mistral/mistral-small-latest"
        ]
      },
      "subagents": {
        "maxConcurrent": 8,
        "model": "google/gemini-3-flash-preview"
      },
      "heartbeat": {
        "every": "60m",
        "model": "google/gemini-2.5-flash-lite"
      }
    }
  }
}

5 Set Up Context Pruning

Add the pruning config to keep costs down on your interactive chat model. A 6-hour TTL works well for most use cases:

"contextPruning": {
  "mode": "cache-ttl",
  "ttl": "6h",
  "keepLastAssistants": 3,
  "softTrimRatio": 0.7,
  "hardClearRatio": 0.85
}

6 Create Your First Cron Job

Start simple. A daily news scan, a social media post, a reminder. The key: always set the model to a free provider and use isolated sessions:

{
  "name": "Daily News Scan",
  "schedule": {
    "kind": "cron",
    "expr": "0 13 * * *",
    "tz": "UTC"
  },
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "message": "Search the web for [your topic] news from
               the last 24 hours. Summarize the top 3 stories.
               Save to workspace/daily-news/YYYY-MM-DD.md.",
    "model": "google/gemini-3-flash-preview",
    "timeoutSeconds": 120
  }
}

7 Add Memory Persistence

Set up the memory flush system so your agent doesn't lose important context when sessions compact. Create a memory/ directory in your workspace and enable memory flush in compaction settings. Add a twice-daily memory sync cron to consolidate daily logs into a master memory file.

8 Scale Up

Once your basic setup is running, start adding more cron jobs. The pattern is always the same: write clear instructions in the payload message, set the model to a free provider, use isolated sessions, and set a reasonable timeout. Each new cron job costs $0 in API fees.

The 80/20 rule of this stack: 80% of the value comes from getting Steps 2-4 right. Model routing is the architecture decision that makes everything else affordable. Everything after that is optimization.

The Bigger Picture

We're in a weird moment for autonomous AI agents. The models are good enough to do real work — post to social media, analyze markets, generate content, automate browser tasks. But the pricing models assume you're using these capabilities in a chatbot context, not running them 24/7 as autonomous systems.

The gap between "what AI can do" and "what most people can afford to run" is mostly an architecture problem. If you use one model for everything, costs explode. If you route strategically — expensive models for expensive tasks, free models for grunt work — you can run a genuinely useful autonomous agent for less than your Netflix subscription.

I'm not saying this replaces hiring. But as a solo operator running 9 brands, having a tireless AI chief of staff that posts content, monitors markets, manages cross-promotion, syncs memory, and handles batch operations — all while I sleep — has been the single biggest leverage point in my operation.

The tools exist. The free tiers exist. The architecture isn't complicated. The only thing stopping most people is the assumption that AI automation has to be expensive.

It doesn't. I proved it for $5/month.

Build Your Own Autonomous AI Agent Stack

I'm building a step-by-step course that walks you through this entire setup. From zero to a working autonomous agent in a weekend. Get early access and founding member pricing.

Sign up for updates → Follow @OpsDeskAI →

Related reading:

Frequently Asked Questions

How much does it cost to run an autonomous AI agent 24/7?

With multi-provider model routing and aggressive use of free tiers (Google Gemini, Groq, Mistral), you can run a fully autonomous AI agent for $3-5/month. The key is reserving paid models like Anthropic Sonnet for interactive chat only, and routing all automated cron jobs and subagent tasks to free-tier providers.

What hardware do you need for a 24/7 AI agent?

A Mac Mini M4 Pro is ideal for low power consumption and local model fallback, but any machine that can stay online 24/7 works. A Linux NUC, old laptop, or even a cloud VPS can run the agent gateway. The AI inference itself happens via cloud APIs, so hardware requirements are minimal.

Can you use free AI APIs for production autonomous agents?

Yes. Google Gemini offers up to 1,500 requests/day on Flash models for free. Groq provides free Llama inference. Mistral offers free access to Mistral Small. For autonomous cron jobs that follow structured instructions, these free models are more than capable of production work.