Storyie has run its own Real User Monitoring pipeline since the early days of the product. No Datadog, no SpeedCurve — just three things we already had: Next.js's built-in useReportWebVitals, Supabase for storage, and an SST Lambda Cron for daily aggregation. This post covers the design, the reasoning behind each decision, and what running it in production actually taught us.
TL;DR
useReportWebVitalscaptures LCP, FCP, CLS, and INP in the browser with zero custom instrumentation.- Samples are batched and sent via
sendBeaconto a single API route, then stored in Supabase. - A daily Lambda Cron computes p75 per page, flags SLA breaches, and fires a Slack alert if two consecutive days breach threshold.
- Total infrastructure cost: $0/month on current traffic.
Layer | Technology | Responsibility |
|---|---|---|
Collection |
| Capture and batch-deliver RUM samples from the browser |
Storage | Supabase PostgreSQL (3 tables) | Raw events (30-day retention), daily aggregates, alert log |
Aggregation | SST Lambda Cron (daily, UTC midnight) | p75 computation, breach detection, Slack notification |
Access control | Supabase RLS | Raw events restricted to service role; aggregates readable |
Why we built it ourselves
The cost math doesn't work at small scale
Commercial RUM tools are priced for engineering organizations with dedicated observability budgets. At Storyie's traffic volume, we'd be paying for a product where we'd use about 10% of the feature set. The counter-argument — "just pay it and focus on building" — is reasonable, but it falls apart when the free tier already solves the problem.
Next.js ships the hard part
useReportWebVitals handles everything we'd otherwise have to write: PerformanceObserver wiring, element timing correlation, the long-tasks attribution. We get LCP, FCP, CLS, and INP out of the box. The remaining work is transport and storage — commodity problems.
Data ownership matters
When your RUM data lives in a third-party product, the aggregation window and retention period are theirs to define. With Supabase, we can run any query we want: LCP trend by page over the past three months, CLS distribution on mobile vs. desktop, the specific image URL that was the LCP element on a given day. Ad hoc analysis is one SQL query away.
Overall architecture
[Browser]
↓ useReportWebVitals → batched via sendBeacon
[Next.js API Route] /api/performance/vitals
↓ writes via Supabase service role
[Supabase PostgreSQL]
├── performance_events (raw RUM samples, 30-day TTL)
├── performance_daily (p75 aggregates, kept forever)
└── performance_alerts (alert history, kept forever)
↑ aggregated nightly by Lambda
[SST Lambda Cron]
↓ 2 consecutive breach days → Slack notification
[Slack Webhook]Collection layer
The client component
// WebVitalsReporter.tsx (simplified)
"use client";
import { useReportWebVitals } from "next/web-vitals";
const BATCH_SIZE = 10;
const FLUSH_INTERVAL_MS = 5000;
export function WebVitalsReporter() {
const batchRef = useRef<VitalSample[]>([]);
const flushBatch = useCallback(() => {
if (batchRef.current.length === 0) return;
const samples = [...batchRef.current];
batchRef.current = [];
const payload = JSON.stringify({ samples });
// sendBeacon survives page navigation; fetch with keepalive as fallback
if (navigator.sendBeacon) {
const blob = new Blob([payload], { type: "application/json" });
navigator.sendBeacon("/api/performance/vitals", blob);
} else {
fetch("/api/performance/vitals", {
method: "POST",
body: payload,
keepalive: true,
}).catch(() => {}); // silent failure by design
}
}, []);
useReportWebVitals((metric) => {
if (!["LCP", "FCP", "CLS", "INP"].includes(metric.name)) return;
batchRef.current.push({
path: window.location.pathname,
metric: metric.name.toLowerCase(),
value: metric.value,
navigationType: getNavigationType(),
deviceClass: getDeviceClass(),
collectedAt: new Date().toISOString(),
sessionHash: getSessionHash(),
});
if (batchRef.current.length >= BATCH_SIZE) flushBatch();
});
return null;
}Three design choices worth explaining:
sendBeacon first. LCP is often finalized at the moment the user navigates away. A regular fetch gets cancelled on unload; sendBeacon queues the request in the browser's background transport and completes it regardless. The keepalive: true fallback gives fetch similar unload-survival behavior in browsers where sendBeacon isn't available.
Batching over one-per-metric requests. Four metrics times the number of page transitions adds up quickly. A batch of 10 with a 5-second fallback flush keeps API call count low without meaningfully delaying delivery.
Silent failure throughout. If the API is down or the network drops the request, nothing happens from the user's perspective. A missing sample has no product impact, and monitoring infrastructure should never degrade the thing it's monitoring.
Database schema
Page tiers
CREATE TYPE page_performance_tier AS ENUM ('tier_1', 'tier_2');Not all pages deserve equal attention. High-traffic pages (home, Explore, blog index) are tier 1 — a regression there affects a large fraction of users. Low-traffic pages that rarely appear in search are tier 2. Tiering the pages means our alerting focuses signal on what matters and avoids noise from pages with tiny sample sizes.
Three-table layout
Table | Purpose | Retention |
|---|---|---|
| Raw RUM samples | 30 days |
| Per-page p75 aggregates | Indefinite |
| Alert open/resolved history | Indefinite |
Raw events are ephemeral: once they've been aggregated, the individual samples have little ongoing value. The daily aggregates and alert log are what you reach for in a post-incident review or a trend analysis.
RLS policies
// Raw events: service role write only — no client-side reads
insertPolicy: pgPolicy("events_insert", {
for: "insert",
withCheck: sql`(select auth.role()) = 'service_role'`,
}),
// Daily aggregates: authenticated users can read (for the internal dashboard)
selectPolicy: pgPolicy("daily_select", {
for: "select",
using: sql`(select auth.role()) = 'authenticated'
OR (select auth.role()) = 'service_role'`,
}),Supabase RLS keeps raw event data off-limits to anything other than the service role. Aggregates are readable by authenticated users so we can build an internal dashboard without a separate backend layer.
Aggregation and alerting
What the Lambda Cron does each night
- Fetch yesterday's raw samples from
performance_events. - Compute p75 for LCP, CLS, and INP per page per day.
- Evaluate thresholds: LCP p75 > 1000ms →
breach; fewer than 10 samples →insufficient_samples. - Check for consecutive breaches: two consecutive
breachdays on a tier-1 page → Slack notification. - Auto-resolve alerts: pages that return to
passclose any open alert automatically.
Why 1000ms for LCP, not Google's 2500ms
Google's "Good" band covers global traffic on all connection types. Our public pages are SSG or ISR served from a CDN. For that delivery model, a p75 LCP above 1000ms means something is actively wrong — an image wasn't optimized, a render-blocking font slipped in, or there's a CDN issue. The threshold is ours to set, and we set it tighter than the global baseline because our delivery model justifies it.
Why two consecutive breach days
A single bad day is usually noise: a CDN hiccup, an unusual geographic cohort, a brief spike in low-bandwidth traffic. Two consecutive days indicates structural degradation — the kind caused by a bad deploy or a newly introduced unoptimized asset. Three days was too slow; we want an alert by day two. One day produced too many false positives.
What production taught us
Attribution data is the most useful column in the table
// Capture which element drove the LCP measurement
if (metric.name === "LCP" && "element" in metric.attribution) {
attribution.element = lcpAttr.element;
attribution.url = lcpAttr.url; // usually an image URL
}When an LCP alert fires, the first question is always "which element?" Without attribution, you're back to guessing. With it, you look at the URL column and immediately know whether you're dealing with an unoptimized hero image, a slow web font, or something else entirely. This single field cuts investigation time significantly.
Session deduplication matters more than expected
Without it, a user who reloads a page five times contributes five samples to the p75 calculation. Heavy users — often developers and power users — reload more than casual ones, which pulls p75 downward and masks real regressions affecting normal traffic. We store a hash of UA + session ID (no raw identifiers) and deduplicate by it during aggregation.
A dry-run flag is not optional
The aggregation Lambda runs against production data. Before we wired it to Supabase in prod, we added a --dry-run flag that runs the full calculation pipeline and prints what it would write without touching any tables. We use it every time we change the aggregation logic. It made the iteration cycle from local to prod deploy fast and safe.
RUM and Lighthouse CI answer different questions
We also push Lighthouse CI results into the same performance_daily table. The combination is more useful than either alone: RUM shows what real users experienced, Lighthouse shows what an ideal simulated user would experience. When both degrade together, the cause is in the code or assets. When RUM is stable but Lighthouse degrades, it's probably an artifact of the simulated environment. When RUM degrades but Lighthouse is flat, look at CDN or network conditions.
Cost
Component | Monthly cost |
|---|---|
Supabase (DB storage) | Free tier |
Lambda Cron (once/day) | Free tier |
Slack Webhook | Free |
Total | $0 |
If traffic grows to the point where raw event volume pushes us past Supabase's free row limits, the first adjustment is shortening the raw data retention window (30 days → 14 days). We'd need significantly more traffic before paying for storage is the right trade-off.
Takeaways
Self-hosted Web Vitals monitoring is a reasonable choice when three conditions hold:
- Your framework ships the measurement infrastructure (
useReportWebVitalsin Next.js's case — you don't write aPerformanceObserver). - You have a database that can handle ingest without per-row pricing (Supabase's free tier here).
- A daily cron is granular enough for your alerting needs — real-time streaming is overkill for catching production regressions in a small product.
The result is a system where we own the data, control the aggregation logic, and set thresholds calibrated to our own delivery model rather than a global baseline. That last point — custom thresholds — turns out to be more valuable than it sounds. A 1000ms LCP target is meaningful for an ISR/SSG site in a way that 2500ms simply isn't.
Related Posts
- Building a Monorepo with pnpm and TypeScript — workspace conventions and dependency rules that keep our infra packages organized
- Deploying Next.js to AWS with SST v3 — the SST setup that powers the Lambda Cron used here
Try Storyie
The monitoring pipeline described here runs on storyie.com. If you write a diary there and it opens instantly, this is (partly) why. The iOS app is also available.