26 May 2026 14 min read Python

The 3am Pager - A Scrappy LLM Cost Monitor with Python and ntfy.sh

You shipped an LLM feature last quarter. The demo worked, the stakeholders were happy, and the model output was good enough to put in front of users. Three months in, your bill is climbing, and your board is asking what your gross margin looks like by feature.

You open the provider dashboard. It shows you a monthly total. That's it.

But here is the uncomfortable question: do you actually know which of your features is profitable? Not "are we under budget this month", but which features, which users, and which prompt shapes are eating your margin - and would you know if one of them was looping at 3am, burning through your runway one token at a time?

If you can't answer that with a query, you don't have a monitoring system. You have a credit card statement and a hope.

This article is for the ones building an LLM product and want to know what it actually costs you - per feature, per user, per request - without paying for Datadog or wiring up OpenTelemetry. We'll walk through a Python wrapper (Anthropic in the example, the same pattern for any provider), a SQLite cost store, and a small watcher script that pages you via ntfy.sh before a bad night becomes a bad month.

All code in this article is available in the companion repository:

The 3am Story

Friday afternoon. You merge a small prompt change for your busiest feature - one extra sentence in the system prompt, asking the model to show its reasoning before answering. CI is green. Your eval suite passes. You ship it and close your laptop for the weekend.

The change does exactly what you asked. Every response now includes a reasoning section. Output tokens jump by 10×. Your bill, which usually accumulates at a few hundred dollars a night, starts accumulating at a few thousand. Saturday at 3am, your cost-per-request has tripled. No pager fires. No alert exists.

Monday morning. You open the provider Dashboard to check usage and notice the provider invoice line: $4,200 - roughly 12% of your monthly runway, spent while everyone slept. It shows you a single number: $4,200. It does not show you which feature spent it, which users triggered it, or whether one bad request shape was responsible for half the total. You post in #engineering: "did anything change this weekend?". Three replies, all variations of "I don't think so". You grep your logs. There are 380,000 log lines. None of them include token counts. Three coffees in, you have a number and no story.

This did not have to be a Monday-morning discovery. The change went out Friday afternoon. By Saturday at 3am, three log lines and one SQL query would have known. Not the bill - the system. A pager at 3am is bad. A pager at 3am means someone is looking. The actual horror is the silence: no pager, no email, no signal, just a bill that arrived after the damage was already done.

A bill is not an alert. It is a receipt.

Why Total Spend Lies to You

Your provider dashboard gives you one number: total spend. That number answers one question - "did we spend money?" - and no others. It cannot tell you which features are profitable, which users are burning through your compute, or how much your average request cost has grown since the last prompt change. You need all three to run a margin-positive LLM product.

The dashboard gives you none of them:

Per-feature cost. Two features at $3k/month each look identical on the bill - but one serves 50,000 paying users at $0.06 per use, and the other serves 200 free-tier users at $15 per use. One is your business. The other is a leak. Total spend hides which is which.
Per-user cost. In almost every LLM product I've seen, roughly 0.5% of users account for 40% of spend. Without per-user attribution, you cannot decide whether to rate-limit them, upsell them, or remove them. You just watch the number climb.
Per-request cost shape. Your average request has quietly grown from 1,200 tokens to 8,400 tokens since the last prompt change - the new system prompt is three pages long and gets prepended to every call. The bill does not show this until the end of the month, when the damage is already done.

Total spend is not a monitoring signal. It is a lagging confirmation that something already went wrong.

If cost tracking is the kind of production detail you wish someone had covered before you shipped, that's exactly the gap Production AI Agents with PydanticAI is written to close. It goes past the "hello world" agent demos into the parts that actually bite in production: structured error handling, retries, observability, cost control, and the operational discipline that separates a weekend prototype from something you can leave running unattended.

Here is what to monitor instead.

Three Alerts That Cover 95% of Disasters

You don't need ten alerts. You need three. Three signals cover the failure modes that will actually cost you money before anyone notices - not the long tail of edge cases, but the ones that show up in your bill at the end of the month and in your Slack on Monday morning.

Get these three working before you build anything else:

Per-user spend rate. A user finds a way to loop your agent - a recursive prompt, an autoplay feature nobody rate-limited, a script someone wrote to use your app as a cheap API. Without this alert, you find out at the end of the month. With it, you find out at the end of the hour. Fires when any single user spends more than $5 in 60 minutes. Does not catch slow spend distributed across many users - that is what alert 2 is for.
Per-feature daily spend. You shipped a prompt change on Friday. By Sunday it had tripled the token count on your busiest feature - a longer reasoning block, an extra sentence prepended to every call 50,000 times, a context window that quietly doubled. This is the alert that would have fired before your Monday morning Slack message. Fires when a feature's daily spend exceeds its rolling 7-day p95 by 50%. The p95 threshold needs at least 30 days of baseline data to be meaningful. Fall back to a $50/day hard ceiling for the first month. Does not catch gradual creep over weeks - that is what the SQL query in the next point is for.
Single-request cost ceiling. One request. $4. You found out from the bill. The cause is one of three things: a prompt-injection attempt that asked your model to write a 30,000-token essay, a tool-calling loop that did not terminate, or a context that grew unbounded across turns. Any of them trips this. Fires when any single request costs more than $0.50. Does not catch many cheap requests adding up - see alert 2.

Start with these thresholds and move them when you have a reason to:

The point of an alert is not to be quiet. It is to be the first thing that knows.

The Wrapper

The wrapper is a single function that sits between your code and the Anthropic API. Your existing call to client.messages.create() becomes tracked_create(client, feature="...", user_id="...", ...). Every other argument passes through unchanged. The response object is identical. The only difference is a SQLite row written before the function returns.

The only dependencies are the Anthropic SDK and requests:

pip install anthropic requests

And the script:

# costs.py
import sqlite3, time, uuid
import anthropic

DB_PATH = "costs.db"

# Prices in USD per 1,000 tokens — see "The Pricing Table" for the full dict
# Verify current prices at: https://www.anthropic.com/pricing
PRICING = {
    ("anthropic", "claude-opus-4-7"):   {"input": 0.005,   "output": 0.025},
    ("anthropic", "claude-sonnet-4-6"): {"input": 0.003,   "output": 0.015},
    ("anthropic", "claude-haiku-4-5"):  {"input": 0.001, "output": 0.005},
}

def _init_db() -> None:
    with sqlite3.connect(DB_PATH) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS cost_events (
                id TEXT PRIMARY KEY, ts TEXT NOT NULL,
                provider TEXT NOT NULL, model TEXT NOT NULL,
                feature TEXT NOT NULL, user_id TEXT NOT NULL,
                prompt_tokens INTEGER NOT NULL, completion_tokens INTEGER NOT NULL,
                cost_usd REAL NOT NULL, latency_ms REAL NOT NULL
            )
        """)

_init_db()  # runs once on import — safe for SQLite, no migration script needed

def _cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
    rates = PRICING.get(("anthropic", model))
    if rates is None:
        raise ValueError(
            f"No pricing entry for anthropic/{model}. Add it to PRICING before deploying."
        )
    return (prompt_tokens / 1000) * rates["input"] + (completion_tokens / 1000) * rates["output"]

def tracked_create(client: anthropic.Anthropic, *, feature: str, user_id: str, **kwargs):
    """Drop-in for client.messages.create. Logs cost and latency to SQLite."""
    t0 = time.perf_counter()
    response = client.messages.create(**kwargs)
    latency_ms = round((time.perf_counter() - t0) * 1000, 1)

    model = kwargs["model"]
    cost = _cost(model, response.usage.input_tokens, response.usage.output_tokens)

    with sqlite3.connect(DB_PATH) as conn:
        conn.execute(
            "INSERT INTO cost_events VALUES (?,?,?,?,?,?,?,?,?,?)",
            (str(uuid.uuid4()), time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
             "anthropic", model, feature, user_id,
             response.usage.input_tokens, response.usage.output_tokens, cost, latency_ms),
        )
    return response

This is the entire instrumentation layer. Three functions, one file - Anthropic SDK and SQLite3 (stdlib). The watcher adds requests for ntfy. The rows it writes are what the watcher, the SQL queries, and the alerts all read from.

Switching providers means changing three lines: the import, response.usage.input_tokens (and output tokens) to whatever your provider's SDK calls input tokens, and the PRICING dict keys. The pattern is identical.

The Pricing Table

Every LLM provider publishes a pricing page. None of them expose a pricing API. The source of truth is HTML, and it changes without notice when a model is updated, deprecated, or re-priced for a new tier. There is no authoritative feed to subscribe to, no webhook to catch the change. The moment you deploy, your table starts drifting from reality.

The pragmatic answer is not to fight this. It is to own it explicitly: one dict, one file, one dated comment, and a function that throws an error the moment you call a model that isn't in it.

# costs.py — PRICING dict
# Prices in USD per 1,000 tokens. Last verified: 2026-05-19
# Source: https://www.anthropic.com/pricing — check before each deploy
PRICING = {
    ("anthropic", "claude-opus-4-7"):   {"input": 0.005,   "output": 0.025},
    ("anthropic", "claude-sonnet-4-6"): {"input": 0.003,   "output": 0.015},
    ("anthropic", "claude-haiku-4-5"):  {"input": 0.001, "output": 0.005},
}

Adding a different provider is four lines - same shape, different key prefix:

# Add inside your PRICING dict (source: https://openai.com/api/pricing):
PRICING.update({
    ("openai", "gpt-5.5"):     {"input": 0.005,  "output": 0.030},
    ("openai", "gpt-5.4-mini"):{"input": 0.00075, "output": 0.0045},
})

The fail-loud check - already in _cost() in costs.py - raises a ValueError the moment your code tries to log a cost for a model not in the table. You find out in CI or on first run, not at the end of the month when you notice the cost column is suspiciously round. A silent zero is worse than a missing alert.

One caveat: this table covers the standard pay-per-token case. It does not cover prompt caching (Anthropic's cache read tokens cost roughly 10% of normal input price), fine-tuned model rates, or batch API discounts. Those all have separate rates and separate line items on your bill. Add them when they apply to your stack. Build the table you can maintain, not the one that is complete.

A pricing table you maintain is more useful than a pricing API you trust.

The Watcher

The watcher needs somewhere to send alerts. ntfy.sh is an open-source pub/sub notification service. You publish a message to a URL. Anyone subscribed to that same URL gets a push notification on their phone. No accounts. No API keys. No dashboard to configure. One HTTP POST per alert, and the free hosted version at ntfy.sh handles the rest.

To set it up: pick any string as your topic name - it becomes the URL path (https://ntfy.sh/your-topic), install the ntfy app on your phone, and subscribe to that same name. Three minutes. PagerDuty does not.

Your topic name is effectively a bearer token. Anyone who knows it can subscribe. Keep it out of your repo; NTFY_TOPIC in watcher.py should come from an environment variable in production.