Why build your own RUM pipeline instead of using an existing tool like Datadog or SpeedCurve?

The honest answer is cost-to-value ratio. At Storyie's traffic volume, commercial RUM products are priced for teams with dedicated infra budgets — and we'd be paying for a large feature surface we'd never use. The flip side is that Next.js ships useReportWebVitals built in, Supabase gives us a real Postgres database on the free tier, and AWS Lambda's free tier handles a daily cron with room to spare. Stitching those together takes an afternoon, and the result is a system we own completely: no data-retention limits, no vendor-defined aggregation, no monthly invoice.

Why use sendBeacon instead of fetch for sending metric samples?

LCP in particular is often finalized just as the user navigates away from the page. A regular fetch fires and then gets cancelled by the browser when the page unloads. sendBeacon hands the payload to the browser's background transport queue, so the request completes even if the page is already gone. We fall back to fetch with keepalive: true for environments where sendBeacon isn't available — the keepalive flag gives it similar unload-survival properties. Silent failure on both paths is intentional: a missing sample has no user-visible effect, and monitoring should never degrade the product it's watching.

What does the p75 threshold of 1000ms for LCP actually mean, and how did you arrive at it?

Google's Core Web Vitals standard calls LCP 'good' at 2500ms or below, but that's a global baseline covering everything from 2G mobile to fast fiber. Storyie's public pages — marketing pages and blog posts — are served as SSG or ISR from a CDN. For that delivery model, an LCP above 1000ms at the 75th percentile means something is genuinely wrong: an unoptimized image slipped through, a font is blocking render, or there's a CDN misconfiguration. Setting the threshold tighter than Google's recommendation is one of the concrete advantages of owning your own monitoring stack.

Why does the alert fire on two consecutive breach days rather than one?

A single breach day is often noise — a CDN hiccup, an unusual geographic cohort, a short-lived outage. Two consecutive days raises the confidence that the degradation is structural rather than transient. We looked at three days too, but that felt too slow: if a bad deploy ships on Monday, we want a Slack message by Wednesday at the latest, not Friday. Two days hits the right balance between sensitivity and signal quality for a small, fast-moving team.

How do you handle duplicate samples from users who reload a page multiple times?

Each sample carries a sessionHash derived from a combination of user-agent and session identifier, stored as a hash only — no raw UA string, no identifiable token. During daily aggregation the Lambda deduplicates by sessionHash before computing p75, so a user who reloads five times contributes one data point to the percentile calculation rather than five. Without this, heavy users would pull the p75 down and mask real regressions affecting normal traffic.

How does this RUM data work alongside Lighthouse CI?

They answer different questions. RUM tells us what real users on real networks and devices actually experienced. Lighthouse CI tells us what a simulated ideal user on a throttled connection would experience in a fully controlled environment. We push both sets of numbers into the same Supabase tables so we can query them side by side. When RUM p75 degrades but Lighthouse is stable, we look at CDN or network conditions. When Lighthouse degrades but RUM is flat, the regression is probably limited to simulated conditions and not yet visible in the wild — useful for catching things early before they ship to more users.

Web Vitals monitoring without a SaaS: Next.js, Supabase, and a Lambda Cron

Storyie has run its own Real User Monitoring pipeline since the early days of the product. No Datadog, no SpeedCurve — just three things we already had: Next.js's built-in useReportWebVitals, Supabase for storage, and an SST Lambda Cron for daily aggregation. This post covers the design, the reasoning behind each decision, and what running it in production actually taught us.

TL;DR

useReportWebVitals captures LCP, FCP, CLS, and INP in the browser with zero custom instrumentation.
Samples are batched and sent via sendBeacon to a single API route, then stored in Supabase.
A daily Lambda Cron computes p75 per page, flags SLA breaches, and fires a Slack alert if two consecutive days breach threshold.
Total infrastructure cost: $0/month on current traffic.

Layer	Technology	Responsibility
Collection	`useReportWebVitals` + `sendBeacon`	Capture and batch-deliver RUM samples from the browser
Storage	Supabase PostgreSQL (3 tables)	Raw events (30-day retention), daily aggregates, alert log
Aggregation	SST Lambda Cron (daily, UTC midnight)	p75 computation, breach detection, Slack notification
Access control	Supabase RLS	Raw events restricted to service role; aggregates readable

Why we built it ourselves

The cost math doesn't work at small scale

Commercial RUM tools are priced for engineering organizations with dedicated observability budgets. At Storyie's traffic volume, we'd be paying for a product where we'd use about 10% of the feature set. The counter-argument — "just pay it and focus on building" — is reasonable, but it falls apart when the free tier already solves the problem.

Next.js ships the hard part

useReportWebVitals handles everything we'd otherwise have to write: PerformanceObserver wiring, element timing correlation, the long-tasks attribution. We get LCP, FCP, CLS, and INP out of the box. The remaining work is transport and storage — commodity problems.

Data ownership matters

When your RUM data lives in a third-party product, the aggregation window and retention period are theirs to define. With Supabase, we can run any query we want: LCP trend by page over the past three months, CLS distribution on mobile vs. desktop, the specific image URL that was the LCP element on a given day. Ad hoc analysis is one SQL query away.

Overall architecture

[Browser]
  ↓ useReportWebVitals → batched via sendBeacon
[Next.js API Route] /api/performance/vitals
  ↓ writes via Supabase service role
[Supabase PostgreSQL]
  ├── performance_events      (raw RUM samples, 30-day TTL)
  ├── performance_daily       (p75 aggregates, kept forever)
  └── performance_alerts      (alert history, kept forever)
  ↑ aggregated nightly by Lambda
[SST Lambda Cron]
  ↓ 2 consecutive breach days → Slack notification
[Slack Webhook]

Collection layer

The client component

// WebVitalsReporter.tsx (simplified)
"use client";

import { useReportWebVitals } from "next/web-vitals";

const BATCH_SIZE = 10;
const FLUSH_INTERVAL_MS = 5000;

export function WebVitalsReporter() {
  const batchRef = useRef<VitalSample[]>([]);

  const flushBatch = useCallback(() => {
    if (batchRef.current.length === 0) return;
    const samples = [...batchRef.current];
    batchRef.current = [];
    const payload = JSON.stringify({ samples });

    // sendBeacon survives page navigation; fetch with keepalive as fallback
    if (navigator.sendBeacon) {
      const blob = new Blob([payload], { type: "application/json" });
      navigator.sendBeacon("/api/performance/vitals", blob);
    } else {
      fetch("/api/performance/vitals", {
        method: "POST",
        body: payload,
        keepalive: true,
      }).catch(() => {}); // silent failure by design
    }
  }, []);

  useReportWebVitals((metric) => {
    if (!["LCP", "FCP", "CLS", "INP"].includes(metric.name)) return;

    batchRef.current.push({
      path: window.location.pathname,
      metric: metric.name.toLowerCase(),
      value: metric.value,
      navigationType: getNavigationType(),
      deviceClass: getDeviceClass(),
      collectedAt: new Date().toISOString(),
      sessionHash: getSessionHash(),
    });

    if (batchRef.current.length >= BATCH_SIZE) flushBatch();
  });

  return null;
}

Three design choices worth explaining:

sendBeacon first. LCP is often finalized at the moment the user navigates away. A regular fetch gets cancelled on unload; sendBeacon queues the request in the browser's background transport and completes it regardless. The keepalive: true fallback gives fetch similar unload-survival behavior in browsers where sendBeacon isn't available.

Batching over one-per-metric requests. Four metrics times the number of page transitions adds up quickly. A batch of 10 with a 5-second fallback flush keeps API call count low without meaningfully delaying delivery.

Silent failure throughout. If the API is down or the network drops the request, nothing happens from the user's perspective. A missing sample has no product impact, and monitoring infrastructure should never degrade the thing it's monitoring.

Database schema

Page tiers

CREATE TYPE page_performance_tier AS ENUM ('tier_1', 'tier_2');

Not all pages deserve equal attention. High-traffic pages (home, Explore, blog index) are tier 1 — a regression there affects a large fraction of users. Low-traffic pages that rarely appear in search are tier 2. Tiering the pages means our alerting focuses signal on what matters and avoids noise from pages with tiny sample sizes.

Three-table layout

Table	Purpose	Retention
`performance_events`	Raw RUM samples	30 days
`performance_daily`	Per-page p75 aggregates	Indefinite
`performance_alerts`	Alert open/resolved history	Indefinite

Raw events are ephemeral: once they've been aggregated, the individual samples have little ongoing value. The daily aggregates and alert log are what you reach for in a post-incident review or a trend analysis.

RLS policies

// Raw events: service role write only — no client-side reads
insertPolicy: pgPolicy("events_insert", {
  for: "insert",
  withCheck: sql`(select auth.role()) = 'service_role'`,
}),

// Daily aggregates: authenticated users can read (for the internal dashboard)
selectPolicy: pgPolicy("daily_select", {
  for: "select",
  using: sql`(select auth.role()) = 'authenticated'
    OR (select auth.role()) = 'service_role'`,
}),

Supabase RLS keeps raw event data off-limits to anything other than the service role. Aggregates are readable by authenticated users so we can build an internal dashboard without a separate backend layer.

Aggregation and alerting

What the Lambda Cron does each night

Fetch yesterday's raw samples from performance_events.
Compute p75 for LCP, CLS, and INP per page per day.
Evaluate thresholds: LCP p75 > 1000ms → breach; fewer than 10 samples → insufficient_samples.
Check for consecutive breaches: two consecutive breach days on a tier-1 page → Slack notification.
Auto-resolve alerts: pages that return to pass close any open alert automatically.

Why 1000ms for LCP, not Google's 2500ms

Google's "Good" band covers global traffic on all connection types. Our public pages are SSG or ISR served from a CDN. For that delivery model, a p75 LCP above 1000ms means something is actively wrong — an image wasn't optimized, a render-blocking font slipped in, or there's a CDN issue. The threshold is ours to set, and we set it tighter than the global baseline because our delivery model justifies it.

Why two consecutive breach days

A single bad day is usually noise: a CDN hiccup, an unusual geographic cohort, a brief spike in low-bandwidth traffic. Two consecutive days indicates structural degradation — the kind caused by a bad deploy or a newly introduced unoptimized asset. Three days was too slow; we want an alert by day two. One day produced too many false positives.

What production taught us

Attribution data is the most useful column in the table

// Capture which element drove the LCP measurement
if (metric.name === "LCP" && "element" in metric.attribution) {
  attribution.element = lcpAttr.element;
  attribution.url = lcpAttr.url; // usually an image URL
}

When an LCP alert fires, the first question is always "which element?" Without attribution, you're back to guessing. With it, you look at the URL column and immediately know whether you're dealing with an unoptimized hero image, a slow web font, or something else entirely. This single field cuts investigation time significantly.

Session deduplication matters more than expected

Without it, a user who reloads a page five times contributes five samples to the p75 calculation. Heavy users — often developers and power users — reload more than casual ones, which pulls p75 downward and masks real regressions affecting normal traffic. We store a hash of UA + session ID (no raw identifiers) and deduplicate by it during aggregation.

A dry-run flag is not optional

The aggregation Lambda runs against production data. Before we wired it to Supabase in prod, we added a --dry-run flag that runs the full calculation pipeline and prints what it would write without touching any tables. We use it every time we change the aggregation logic. It made the iteration cycle from local to prod deploy fast and safe.

RUM and Lighthouse CI answer different questions

We also push Lighthouse CI results into the same performance_daily table. The combination is more useful than either alone: RUM shows what real users experienced, Lighthouse shows what an ideal simulated user would experience. When both degrade together, the cause is in the code or assets. When RUM is stable but Lighthouse degrades, it's probably an artifact of the simulated environment. When RUM degrades but Lighthouse is flat, look at CDN or network conditions.

Cost

Component	Monthly cost
Supabase (DB storage)	Free tier
Lambda Cron (once/day)	Free tier
Slack Webhook	Free
Total	$0

If traffic grows to the point where raw event volume pushes us past Supabase's free row limits, the first adjustment is shortening the raw data retention window (30 days → 14 days). We'd need significantly more traffic before paying for storage is the right trade-off.

Takeaways

Self-hosted Web Vitals monitoring is a reasonable choice when three conditions hold:

Your framework ships the measurement infrastructure (useReportWebVitals in Next.js's case — you don't write a PerformanceObserver).
You have a database that can handle ingest without per-row pricing (Supabase's free tier here).
A daily cron is granular enough for your alerting needs — real-time streaming is overkill for catching production regressions in a small product.

The result is a system where we own the data, control the aggregation logic, and set thresholds calibrated to our own delivery model rather than a global baseline. That last point — custom thresholds — turns out to be more valuable than it sounds. A 1000ms LCP target is meaningful for an ISR/SSG site in a way that 2500ms simply isn't.

Building a Monorepo with pnpm and TypeScript — workspace conventions and dependency rules that keep our infra packages organized
Deploying Next.js to AWS with SST v3 — the SST setup that powers the Lambda Cron used here

Try Storyie

The monitoring pipeline described here runs on storyie.com. If you write a diary there and it opens instantly, this is (partly) why. The iOS app is also available.

Web Vitals monitoring without a SaaS: Next.js, Supabase, and a Lambda Cron

TL;DR

Why we built it ourselves

The cost math doesn't work at small scale

Next.js ships the hard part

Data ownership matters

Overall architecture

Collection layer

The client component

Database schema

Page tiers

Three-table layout

RLS policies

Aggregation and alerting

What the Lambda Cron does each night

Why 1000ms for LCP, not Google's 2500ms

Why two consecutive breach days

What production taught us

Attribution data is the most useful column in the table

Session deduplication matters more than expected

A dry-run flag is not optional

RUM and Lighthouse CI answer different questions

Cost

Takeaways

Related Posts

Try Storyie

Supabase Auth across Next.js and Expo: what actually takes work

Deploying Next.js to AWS with SST: CloudFront, IP restrictions, and cron jobs

Eight cron jobs in production: how we run background work on SST v3 and Lambda

TL;DR

Why we built it ourselves

The cost math doesn't work at small scale

Next.js ships the hard part

Data ownership matters

Overall architecture

Collection layer

The client component

Database schema

Page tiers

Three-table layout

RLS policies

Aggregation and alerting

What the Lambda Cron does each night

Why 1000ms for LCP, not Google's 2500ms

Why two consecutive breach days

What production taught us

Attribution data is the most useful column in the table

Session deduplication matters more than expected

A dry-run flag is not optional

RUM and Lighthouse CI answer different questions

Cost

Takeaways

Related Posts

Try Storyie

Related posts

Supabase Auth across Next.js and Expo: what actually takes work

Deploying Next.js to AWS with SST: CloudFront, IP restrictions, and cron jobs

Eight cron jobs in production: how we run background work on SST v3 and Lambda