The 3am Pager - A Scrappy LLM Cost Monitor with Python and ntfy.sh

The 3am Pager - A Scrappy LLM Cost Monitor with Python and ntfy.sh
Image by Kevin Schneider from Pixabay

You shipped an LLM feature last quarter. The demo worked, the stakeholders were happy, and the model output was good enough to put in front of users. Three months in, your bill is climbing, and your board is asking what your gross margin looks like by feature.

You open the provider dashboard. It shows you a monthly total. That's it.

But here is the uncomfortable question: do you actually know which of your features is profitable? Not "are we under budget this month", but which features, which users, and which prompt shapes are eating your margin - and would you know if one of them was looping at 3am, burning through your runway one token at a time?

If you can't answer that with a query, you don't have a monitoring system. You have a credit card statement and a hope.

This article is for the ones building an LLM product and want to know what it actually costs you - per feature, per user, per request - without paying for Datadog or wiring up OpenTelemetry. We'll walk through a Python wrapper (Anthropic in the example, the same pattern for any provider), a SQLite cost store, and a small watcher script that pages you via ntfy.sh before a bad night becomes a bad month.

All code in this article is available in the companion repository:

GitHub - nunombispo/llm-costs
Contribute to nunombispo/llm-costs development by creating an account on GitHub.

The 3am Story

Friday afternoon. You merge a small prompt change for your busiest feature - one extra sentence in the system prompt, asking the model to show its reasoning before answering. CI is green. Your eval suite passes. You ship it and close your laptop for the weekend.

The change does exactly what you asked. Every response now includes a reasoning section. Output tokens jump by 10×. Your bill, which usually accumulates at a few hundred dollars a night, starts accumulating at a few thousand. Saturday at 3am, your cost-per-request has tripled. No pager fires. No alert exists.

Monday morning. You open the provider Dashboard to check usage and notice the provider invoice line: $4,200 - roughly 12% of your monthly runway, spent while everyone slept. It shows you a single number: $4,200. It does not show you which feature spent it, which users triggered it, or whether one bad request shape was responsible for half the total. You post in #engineering: "did anything change this weekend?". Three replies, all variations of "I don't think so". You grep your logs. There are 380,000 log lines. None of them include token counts. Three coffees in, you have a number and no story.

This did not have to be a Monday-morning discovery. The change went out Friday afternoon. By Saturday at 3am, three log lines and one SQL query would have known. Not the bill - the system. A pager at 3am is bad. A pager at 3am means someone is looking. The actual horror is the silence: no pager, no email, no signal, just a bill that arrived after the damage was already done.

A bill is not an alert. It is a receipt.


Why Total Spend Lies to You

Your provider dashboard gives you one number: total spend. That number answers one question - "did we spend money?" - and no others. It cannot tell you which features are profitable, which users are burning through your compute, or how much your average request cost has grown since the last prompt change. You need all three to run a margin-positive LLM product.

The dashboard gives you none of them:

  • Per-feature cost. Two features at $3k/month each look identical on the bill - but one serves 50,000 paying users at $0.06 per use, and the other serves 200 free-tier users at $15 per use. One is your business. The other is a leak. Total spend hides which is which.
  • Per-user cost. In almost every LLM product I've seen, roughly 0.5% of users account for 40% of spend. Without per-user attribution, you cannot decide whether to rate-limit them, upsell them, or remove them. You just watch the number climb.
  • Per-request cost shape. Your average request has quietly grown from 1,200 tokens to 8,400 tokens since the last prompt change - the new system prompt is three pages long and gets prepended to every call. The bill does not show this until the end of the month, when the damage is already done.

Total spend is not a monitoring signal. It is a lagging confirmation that something already went wrong.

Here is what to monitor instead.


Three Alerts That Cover 95% of Disasters

You don't need ten alerts. You need three. Three signals cover the failure modes that will actually cost you money before anyone notices - not the long tail of edge cases, but the ones that show up in your bill at the end of the month and in your Slack on Monday morning.

Get these three working before you build anything else:

  1. Per-user spend rate. A user finds a way to loop your agent - a recursive prompt, an autoplay feature nobody rate-limited, a script someone wrote to use your app as a cheap API. Without this alert, you find out at the end of the month. With it, you find out at the end of the hour. Fires when any single user spends more than $5 in 60 minutes. Does not catch slow spend distributed across many users - that is what alert 2 is for.
  2. Per-feature daily spend. You shipped a prompt change on Friday. By Sunday it had tripled the token count on your busiest feature - a longer reasoning block, an extra sentence prepended to every call 50,000 times, a context window that quietly doubled. This is the alert that would have fired before your Monday morning Slack message. Fires when a feature's daily spend exceeds its rolling 7-day p95 by 50%. The p95 threshold needs at least 30 days of baseline data to be meaningful. Fall back to a $50/day hard ceiling for the first month. Does not catch gradual creep over weeks - that is what the SQL query in the next point is for.
  3. Single-request cost ceiling. One request. $4. You found out from the bill. The cause is one of three things: a prompt-injection attempt that asked your model to write a 30,000-token essay, a tool-calling loop that did not terminate, or a context that grew unbounded across turns. Any of them trips this. Fires when any single request costs more than $0.50. Does not catch many cheap requests adding up - see alert 2.

Start with these thresholds and move them when you have a reason to:

The point of an alert is not to be quiet. It is to be the first thing that knows.


The Wrapper

The wrapper is a single function that sits between your code and the Anthropic API. Your existing call to client.messages.create() becomes tracked_create(client, feature="...", user_id="...", ...). Every other argument passes through unchanged. The response object is identical. The only difference is a SQLite row written before the function returns.

The only dependencies are the Anthropic SDK and requests:

pip install anthropic requests

And the script:

# costs.py
import sqlite3, time, uuid
import anthropic

DB_PATH = "costs.db"

# Prices in USD per 1,000 tokens — see "The Pricing Table" for the full dict
# Verify current prices at: https://www.anthropic.com/pricing
PRICING = {
    ("anthropic", "claude-opus-4-7"):   {"input": 0.005,   "output": 0.025},
    ("anthropic", "claude-sonnet-4-6"): {"input": 0.003,   "output": 0.015},
    ("anthropic", "claude-haiku-4-5"):  {"input": 0.001, "output": 0.005},
}

def _init_db() -> None:
    with sqlite3.connect(DB_PATH) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS cost_events (
                id TEXT PRIMARY KEY, ts TEXT NOT NULL,
                provider TEXT NOT NULL, model TEXT NOT NULL,
                feature TEXT NOT NULL, user_id TEXT NOT NULL,
                prompt_tokens INTEGER NOT NULL, completion_tokens INTEGER NOT NULL,
                cost_usd REAL NOT NULL, latency_ms REAL NOT NULL
            )
        """)

_init_db()  # runs once on import — safe for SQLite, no migration script needed

def _cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
    rates = PRICING.get(("anthropic", model))
    if rates is None:
        raise ValueError(
            f"No pricing entry for anthropic/{model}. Add it to PRICING before deploying."
        )
    return (prompt_tokens / 1000) * rates["input"] + (completion_tokens / 1000) * rates["output"]

def tracked_create(client: anthropic.Anthropic, *, feature: str, user_id: str, **kwargs):
    """Drop-in for client.messages.create. Logs cost and latency to SQLite."""
    t0 = time.perf_counter()
    response = client.messages.create(**kwargs)
    latency_ms = round((time.perf_counter() - t0) * 1000, 1)

    model = kwargs["model"]
    cost = _cost(model, response.usage.input_tokens, response.usage.output_tokens)

    with sqlite3.connect(DB_PATH) as conn:
        conn.execute(
            "INSERT INTO cost_events VALUES (?,?,?,?,?,?,?,?,?,?)",
            (str(uuid.uuid4()), time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
             "anthropic", model, feature, user_id,
             response.usage.input_tokens, response.usage.output_tokens, cost, latency_ms),
        )
    return response

This is the entire instrumentation layer. Three functions, one file - Anthropic SDK and SQLite3 (stdlib). The watcher adds requests for ntfy. The rows it writes are what the watcher, the SQL queries, and the alerts all read from.

Switching providers means changing three lines: the import, response.usage.input_tokens (and output tokens) to whatever your provider's SDK calls input tokens, and the PRICING dict keys. The pattern is identical.


The Pricing Table

Every LLM provider publishes a pricing page. None of them expose a pricing API. The source of truth is HTML, and it changes without notice when a model is updated, deprecated, or re-priced for a new tier. There is no authoritative feed to subscribe to, no webhook to catch the change. The moment you deploy, your table starts drifting from reality.

The pragmatic answer is not to fight this. It is to own it explicitly: one dict, one file, one dated comment, and a function that throws an error the moment you call a model that isn't in it.

# costs.py — PRICING dict
# Prices in USD per 1,000 tokens. Last verified: 2026-05-19
# Source: https://www.anthropic.com/pricing — check before each deploy
PRICING = {
    ("anthropic", "claude-opus-4-7"):   {"input": 0.005,   "output": 0.025},
    ("anthropic", "claude-sonnet-4-6"): {"input": 0.003,   "output": 0.015},
    ("anthropic", "claude-haiku-4-5"):  {"input": 0.001, "output": 0.005},
}

Adding a different provider is four lines - same shape, different key prefix:

# Add inside your PRICING dict (source: https://openai.com/api/pricing):
PRICING.update({
    ("openai", "gpt-5.5"):     {"input": 0.005,  "output": 0.030},
    ("openai", "gpt-5.4-mini"):{"input": 0.00075, "output": 0.0045},
})

The fail-loud check - already in _cost() in costs.py - raises a ValueError the moment your code tries to log a cost for a model not in the table. You find out in CI or on first run, not at the end of the month when you notice the cost column is suspiciously round. A silent zero is worse than a missing alert.

One caveat: this table covers the standard pay-per-token case. It does not cover prompt caching (Anthropic's cache read tokens cost roughly 10% of normal input price), fine-tuned model rates, or batch API discounts. Those all have separate rates and separate line items on your bill. Add them when they apply to your stack. Build the table you can maintain, not the one that is complete.

A pricing table you maintain is more useful than a pricing API you trust.


The Watcher

The watcher needs somewhere to send alerts. ntfy.sh is an open-source pub/sub notification service. You publish a message to a URL. Anyone subscribed to that same URL gets a push notification on their phone. No accounts. No API keys. No dashboard to configure. One HTTP POST per alert, and the free hosted version at ntfy.sh handles the rest.

To set it up: pick any string as your topic name - it becomes the URL path (https://ntfy.sh/your-topic), install the ntfy app on your phone, and subscribe to that same name. Three minutes. PagerDuty does not.

Your topic name is effectively a bearer token. Anyone who knows it can subscribe. Keep it out of your repo; NTFY_TOPIC in watcher.py should come from an environment variable in production.

# watcher.py
import sqlite3, time, os, requests
from costs import DB_PATH

NTFY_TOPIC = os.getenv("NTFY_TOPIC", "https://ntfy.sh/your-secret-topic-name")  # set NTFY_TOPIC env var in production
POLL_INTERVAL = 60  # seconds

# Adding a new alert: add one dict to this list. Nothing else changes.
ALERTS = [
    {
        "name":    "per_user_hourly",
        "query":   """
            SELECT user_id AS entity, ROUND(SUM(cost_usd), 4) AS value
            FROM cost_events WHERE ts >= datetime('now', '-1 hour')
            GROUP BY user_id HAVING value > 5.0
        """,
        "message": lambda r: f"User {r['entity']} burned ${r['value']:.2f} in the last hour",
    },
    {
        "name":    "per_feature_daily",
        "query":   """
            SELECT feature AS entity, ROUND(SUM(cost_usd), 4) AS value
            FROM cost_events WHERE ts >= datetime('now', '-1 day')
            GROUP BY feature HAVING value > 50.0
        """,
        "message": lambda r: f"Feature {r['entity']} spent ${r['value']:.2f} today - check for prompt regressions",
    },
    {
        "name":    "per_request_ceiling",
        "query":   """
            SELECT id AS entity, ROUND(cost_usd, 4) AS value, feature
            FROM cost_events WHERE cost_usd > 0.50 AND ts >= datetime('now', '-1 hour')
        """,
        "message": lambda r: f"Single request ${r['value']:.4f} on {r['feature']}",
    },
]

def _init_db() -> None:
    with sqlite3.connect(DB_PATH) as conn:
        conn.execute("""
            CREATE TABLE IF NOT EXISTS alerts_sent (
                alert_type TEXT NOT NULL, entity_id TEXT NOT NULL, sent_at TEXT NOT NULL
            )
        """)

_init_db()

def _already_sent(conn, alert_type: str) -> bool:
    return conn.execute(
        "SELECT 1 FROM alerts_sent WHERE alert_type=? AND entity_id=? "
        "AND sent_at >= datetime('now', '-1 hour')",
        (alert_type),
    ).fetchone() is not None

def _notify(title: str, message: str) -> None:
    requests.post(NTFY_TOPIC, data=message.encode(), headers={"Title": title}, timeout=5)

def run_alerts() -> None:
    with sqlite3.connect(DB_PATH) as conn:
        conn.row_factory = sqlite3.Row
        for alert in ALERTS:
            for row in conn.execute(alert["query"]).fetchall():
                entity_id = str(row["entity"])
                if _already_sent(conn, alert["name"]):
                    continue
                _notify(alert["name"], alert["message"](row))
                conn.execute(
                    "INSERT INTO alerts_sent VALUES (?,datetime('now'))",
                    (alert["name"]),
                )

if __name__ == "__main__":
    print(f"Watching {DB_PATH}. Polling every {POLL_INTERVAL}s.")
    while True:
        run_alerts()
        time.sleep(POLL_INTERVAL)

The ALERTS list is the key design decision. Each alert is a name, a SQL query, and a message template. Adding a fourth alert is five lines. The _already_sent check deduplicates on a one-hour window per alert.

The per-feature query uses the $50/day hard-ceiling fallback from the Three Alerts section, not the full p95 version. Once you have the history, the companion repo has the full version.

If the watcher dies, systemd with Restart=always or a Docker container with restart: always brings it back. That is enough.

Example of the alerts:

Alerts on ntfy.sh

Two scripts. One database. Your phone rings before your bill does.


The One SQL Query You Actually Need

Your cost_events table already knows which feature is bleeding margin. You don't need a dashboard to ask it.

SELECT
  feature,
  COUNT(*)                                                           AS calls,
  ROUND(SUM(cost_usd), 2)                                           AS total_usd,
  ROUND(AVG(cost_usd), 4)                                           AS avg_call_usd,
  ROUND(SUM(prompt_tokens + completion_tokens) * 1.0 / COUNT(*), 0) AS avg_tokens
FROM cost_events
WHERE ts >= datetime('now', '-7 days')
GROUP BY feature
ORDER BY total_usd DESC;

Run this in your terminal with sqlite3 costs.db. It tells you which feature is spending the most, how many calls it made, what each call costs on average, and how many tokens each call consumes on average. Five columns. One screen. The thing your board is actually asking about.

Example output:

============================================================
Per-feature (7 days)
============================================================
feature | calls | total_usd | avg_call_usd | avg_tokens
------------------------------------------------------------
agent_loop | 1 | 0.82 | 0.82 | 128000.0

What it does not tell you: per-user spend, how cost trended day-by-day, or cost-per-revenue ratio. Per-user spend and day-by-day trending are one-line variants of the same query. The third requires a join to a revenue table you don't have yet - and when you do, the structure is the same.

Per-user spend this week:

SELECT user_id, COUNT(*) AS calls, ROUND(SUM(cost_usd), 2) AS total_usd
FROM cost_events
WHERE ts >= datetime('now', '-7 days')
GROUP BY user_id ORDER BY total_usd DESC LIMIT 20;

Example output:

============================================================
Top users (7 days)
============================================================
user_id | calls | total_usd
------------------------------------------------------------
user_alice | 1 | 0.82

Daily cost by feature - the trend that shows you when a regression happened:

SELECT date(ts) AS day, feature, ROUND(SUM(cost_usd), 2) AS daily_usd
FROM cost_events
WHERE ts >= datetime('now', '-30 days')
GROUP BY date(ts), feature
ORDER BY day DESC, daily_usd DESC;

Example output:

============================================================
Daily by feature (30 days)
============================================================
day | feature | daily_usd
------------------------------------------------------------
2026-05-20 | agent_loop | 0.82

A dashboard is a SQL query someone else got paid to render.


When This Stops Being Enough

This article is not an argument against Datadog. It is an argument against Datadog on day one, before you have a problem Datadog solves. The watcher in this article is what you run until one of the four conditions below becomes true. Most teams never cross any of them.

This setup has four limits. When you hit one, you will know it - because the problem will be specific, not vague.

  1. You sustain more than ~100 LLM calls/second. This is the easiest limit to hit and the easiest to fix. SQLite concurrent writes become the bottleneck first. Move cost_events to Postgres. The wrapper, the watcher, and the SQL queries are unchanged - only the connection string swaps.
  2. You need cross-region or cross-cloud aggregation. A single SQLite file on a single box does not survive multi-region deployments. At this point you want OpenTelemetry semantic conventions and a real metrics backend - Prometheus + Grafana, Datadog, or whatever your platform team already runs.
  3. You are a multi-tenant SaaS at compliance scale. Per-tenant cost attribution for billing is now a finance system, not a monitoring system. It has different reliability requirements, different audit trails, and a different team owning it. This article's setup is not the right foundation for that problem.
  4. You need to alert on token-distribution shifts, not just dollar thresholds. When you start caring about p99 latency by model, prompt-injection signals in token shapes, or model degradation patterns, you are past what a 60-second polling watcher can detect reliably.

Graduate to the heavier stack when the heavier stack is solving a problem you actually have. Not before.


Conclusion: Before the 3am Page

Your LLM bill tells you one thing: how much you spent. It does not tell you which feature spent it, which users drove it, or whether a prompt change you shipped on Friday quietly tripled your cost-per-request over the weekend. The signal you actually need - per-feature, per-user, per-request - is already in your calls. You just have not been writing it down.

Two scripts and a SQLite file change that. tracked_create() writes a row every time your code calls the Anthropic API. The watcher reads those rows every 60 seconds and sends a push notification when any of the three thresholds break. The SQL query tells you which feature is bleeding margin. The whole thing runs on the same box as your application and costs nothing to operate.

Without it, you have a credit card statement and a hope.

Wire it up before the 3am page.

All code in this article is available in the companion GitHub repository:

GitHub - nunombispo/llm-costs
Contribute to nunombispo/llm-costs development by creating an account on GitHub.

The repository includes the wrapper, the watcher, the pricing table, and the SQL queries discussed above.


Follow me on Twitter: https://twitter.com/DevAsService

Follow me on Instagram: https://www.instagram.com/devasservice/

Follow me on TikTok: https://www.tiktok.com/@devasservice

Follow me on YouTube: https://www.youtube.com/@DevAsService

Nuno Bispo

Nuno Bispo

Solutions Architect · Senior Python & AI Engineer · AI Audits · Helping teams fix what they shipped too fast
Netherlands