Why does Storyie use both a character count and a word count threshold instead of one or the other?

The two metrics cover different failure modes that come up in a multilingual app. Japanese and other CJK languages use very few spaces, so a pure word count would discard a substantive 800-character Japanese entry — the tokenizer just doesn't produce many tokens. Latin-script languages go the other way: a word count of 150 is easy to hit by repeating short words, so character count alone isn't a reliable proxy for actual substance. Running both thresholds as an AND condition lets a CJK entry pass on character count while an English entry still has to clear the word count bar. Neither metric is perfect in isolation; together they catch the degenerate cases on both ends.

When a diary entry is noindex, does Storyie also stop following links on that page?

No — we keep follow: true on every page, including noindex ones. The reasoning is that a short diary entry might still link to another public profile or a longer entry that does meet the quality threshold. Dropping follow would cut off that link equity unnecessarily. We only control whether the page itself gets indexed, not whether crawlers can discover downstream URLs through it. This is a deliberate split: robots index controls the page's own presence in search results; follow controls graph traversal. Keeping them separate gives us finer-grained control without cutting off valid discovery paths.

Why is Lexical's full runtime excluded from the OGP image generation path?

OGP images are generated on the server, potentially for every public diary at render time. Pulling in the full Lexical runtime for that job would meaningfully inflate the cold-start cost and add a DOM dependency to a path that runs under Node with no browser context. All we need is the plain text — the formatted output doesn't matter for a 150-character OGP preview. So we wrote a small recursive utility (lexicalJsonToPlainTextServerSafe) that walks the Lexical JSON tree and concatenates TextNode values without importing any Lexical package. It's about 30 lines and handles the nested paragraph/heading/list structure we actually use.

Why does the sitemap only include the 100 most recent public entries instead of all of them?

Two reasons: query cost and generation time. Fetching every public diary in a single sitemap generation pass would produce an expensive Supabase query on each ISR cycle, and for an app at our current scale the marginal SEO gain from entries 101 through N is negligible. When we have enough public content to make it worthwhile, we'll move to a sitemap index that splits entries across multiple child sitemaps — that's the standard pattern for large UGC sites. For now, the 100-entry limit keeps the generation path fast and the Supabase query bill predictable.

What happens to the sitemap if the database is unreachable?

We catch the error in the sitemap generation function and return a minimal valid sitemap containing just the homepage. An empty sitemap would signal to search engines that the site has no content, which could hurt crawl frequency and eventually rankings — even after the DB recovers. Returning the homepage entry is the cheapest way to keep the sitemap valid and non-empty through transient DB failures without requiring a fallback cache or a secondary data store.

How does Storyie handle the duplicate-content risk from subdomain and path-based user URLs?

Public user content is accessible at both storyie.com/u/[slug]/... and [slug].storyie.com/.... We canonicalize to the /u/[slug] path by setting that as the canonical URL in generateMetadata. This tells Google which version to index and accumulate link equity on. Without an explicit canonical, both URLs would compete with each other, fragmenting authority and increasing the chance of unpredictable indexing decisions — a common pitfall with multi-tenant apps that support custom subdomains.

SEO for a diary app: making public content discoverable without compromising privacy

A diary app has an awkward relationship with SEO. Diaries are private by default — that's the whole point. But when a user explicitly flips a diary entry to public, we want search engines to find it, index it well, and send genuinely interested readers their way.

Storyie has been incrementally building out this SEO layer since launch. This post walks through the four-layer design we landed on: how we gate which pages get indexed at all, how we filter out thin content before it reaches search engines, how we produce structured metadata from freeform Lexical JSON, and how the sitemap stays alive even when the database has a bad day.

TL;DR

Diary app SEO is not "index everything" or "block everything" — it's a staged filter where only public, quality-clearing content reaches search engines.
Block non-production crawlers at the environment level first (robots.ts), before anything else.
Apply a bilingual quality threshold — character count for CJK, word count for Latin script — as an AND condition to weed out thin entries.
Build a lightweight lexicalJsonToPlainTextServerSafe utility early; you'll need plain-text extraction in more places than you expect.
Keep follow: true even on noindex pages so link equity can still flow downstream.
A sitemap that returns a partial result on DB failure beats one that returns nothing.

| Layer | Mechanism | Purpose |
| ----- | --------- | ------- |
| 1. Index control | robots.ts + per-page noindex | Block non-production environments; exclude auth/API paths |
| 2. Content quality gate | Character count AND word count threshold | Filter thin entries before they reach crawlers |
| 3. Structured data | JSON-LD (Article), OGP, Twitter Card | Improve how indexed entries appear in search results |
| 4. Sitemap | ISR every hour, DB-failure-safe | Keep crawlers up to date on new public content |

The problem is unique to diary apps

Blog and e-commerce SEO operates on content that was produced with discoverability in mind. Diary apps start from the opposite premise.

Over 95% of content is private. Public visibility is an explicit opt-in. The default must always be private, and any SEO work has to respect that unconditionally.

Quality varies wildly. A single-sentence "tired today" entry and a 2,000-word reflective essay both come out of the same editor as "public diaries." Indexing the one-liner helps nobody.

The content structure is unpredictable. Storyie uses the Lexical rich-text editor, so a diary might be a stream of paragraphs, a mix of headings and lists, or a heavily nested document. We can't assume any particular shape when generating OGP previews or structured data.

Layer 1: Environment-level index control

The first and cheapest thing to get right is making sure crawlers can't reach non-production environments.

const IS_PRODUCTION =
  process.env.NODE_ENV === "production" &&
  BASE_URL.includes("storyie.com") &&
  !BASE_URL.includes("staging.storyie.com");

The staging environment runs on a subdomain of storyie.com. Without the explicit !BASE_URL.includes("staging.storyie.com") guard, the NODE_ENV === "production" check alone would let the staging robots.ts allow crawlers — SST deploys each environment with its own domain but the same build configuration. Staging content appearing in the production index is the kind of duplicate-content incident that's invisible until it surfaces in Search Console six months later.

Paths that are always disallow in production: /api/, /login, /account-name-setup, and anything under the dashboard. We're not trying to hide these from users; we just have no reason to send crawlers there.

Layer 2: Content quality gate

This is where the diary app specifics really show up. Even among legitimately public entries, we apply a threshold before allowing index.

export const SEO_CONTENT_QUALITY_THRESHOLDS = {
  MIN_CHARACTERS: 800,
  MIN_WORDS: 150,
} as const;

Both conditions must pass:

export function shouldIndexDiary(diary: {
  visibility: string;
  content: DiaryContent | string;
}): boolean {
  if (diary.visibility !== "public") return false;
  return evaluateContentQuality(diary.content);
}

The AND condition is intentional. Japanese has very few space-delimited tokens — a solid 800-character entry in Japanese might only tokenize to 20 "words," so a pure word count would discard it. English goes the other way: word count is a meaningful signal, but character count alone doesn't rule out padding. Running both thresholds means CJK entries clear primarily on character count and Latin-script entries clear primarily on word count, while degenerate cases on either end fail both.

The result feeds directly into generateMetadata:

robots: { index: shouldIndex, follow: true }

Note that follow is always true. A short diary might still link to a longer public entry or a profile that does clear the quality bar. Dropping follow on noindex pages would cut off that traversal path unnecessarily. We control the page's own search presence independently from the crawl graph.

Layer 3: Structured data from Lexical JSON

Server-safe plain-text extraction

OGP previews, meta descriptions, quality evaluation, and sitemap lastModified dates all need the same thing: a plain-text representation of a Lexical EditorState. We wrote a single utility for this early and used it everywhere:

// lexicalJsonToPlainTextServerSafe — no Lexical runtime dependency
function extractText(node: LexicalNode): string {
  if (node.type === "text") return node.text;
  if (!node.children) return "";
  return node.children.map(extractText).join(" ");
}

The utility recursively walks the Lexical JSON tree and concatenates TextNode values. It imports nothing from Lexical. That matters because the OGP image generation path runs under Node during ISR — there's no DOM, and pulling in the full Lexical runtime for a 150-character preview would inflate cold-start cost for no reason.

We built this utility early in the project. In retrospect, it should have been one of the first things written — it shows up in far more places than we anticipated.

OGP image generation

Each public diary gets a dynamically generated OGP image via Next.js's opengraph-image.tsx:

apps/web/app/(public)/u/[slug]/diary/[diarySlug]/opengraph-image.tsx

The image pulls the first 150 characters of extracted plain text as a preview. Generation happens at request time and is cached by Next.js's OGP image infrastructure.

JSON-LD structured data

Public diary pages include Article structured data:

const structuredData = {
  "@context": "https://schema.org",
  "@type": "Article",
  headline: title,
  description: description,
  datePublished: datePublished,
  dateModified: dateModified,
  author: { "@type": "Person", name: author.name, url: author.url },
  publisher: { "@type": "Organization", name: "Storyie", url: "https://storyie.com" },
};

We considered BlogPosting and CreativeWork. Article won on pragmatic grounds: it has the widest Rich Results support in Google's documentation. For a small team, the right heuristic is to pick the schema type Google actually tests against.

One security note worth being explicit about: because diary content is user-generated, we sanitize the JSON-LD before embedding it via dangerouslySetInnerHTML. Specifically, < in any user-provided field is replaced with < to prevent a </script> sequence in a diary title from closing the script block early. UGC in <script> tags is a real injection surface; this is not optional.

Canonical URLs

User content is reachable at both storyie.com/u/[slug]/... and [slug].storyie.com/.... We set the canonical to the /u/[slug] path in generateMetadata. Without an explicit canonical, both versions would compete for the same index position and fragment link equity — a predictable pitfall for multi-tenant apps that support custom subdomains.

Layer 4: ISR sitemap

The sitemap runs on a one-hour ISR cycle:

export const revalidate = 3600;

Static pages, blog posts, public diaries, public notes, and user profiles all land in a single sitemap.xml. For diaries and notes, we fetch the 100 most recent public entries from Supabase:

const [publicDiaries, publicNotes] = await Promise.all([
  diaryQueries.getPublicDiaries(100),
  noteQueries.getPublicNotes(100),
]);

The 100-entry cap is a deliberate tradeoff between crawl coverage and query cost. Fetching every public entry on every ISR cycle would produce increasingly expensive queries as the content library grows. When the volume justifies it, we'll split to a sitemap-index.xml with child sitemaps — that's the standard pattern for large UGC catalogs.

Failure mode handling

Sitemap generation can fail if the DB is temporarily unreachable. Our catch block returns the homepage entry rather than nothing:

catch (error) {
  console.error("Error generating sitemap:", error);
  return [{ url: BASE_URL, lastModified: new Date(), changeFrequency: "daily", priority: 1.0 }];
}

An empty sitemap tells crawlers the site has no content. Even a transient DB failure could reduce crawl frequency if the sitemap consistently returns empty. Returning at least the homepage keeps the sitemap valid through the outage without requiring a fallback store or a cache layer.

Things that bit us

noindex requires conviction. The instinct is to index everything and let Google sort it out. In practice, a high volume of thin pages degrades the domain's overall crawl quality. After adding the quality gate, crawl frequency stabilized noticeably — something we confirmed in Search Console. Filtering aggressively early is less risky than it feels.

Plan for plain-text extraction before you need it. We built lexicalJsonToPlainTextServerSafe relatively early, but we still had to retrofit it into a couple of places. If you're building on Lexical for a public-content product, this utility should be among your first infrastructure pieces, not an afterthought.

Staging robots.ts is Day 1 infrastructure. It's easy to assume that a staging URL no one knows about won't get indexed. Google finds subdomains through certificate transparency logs, link discovery, and referrer headers. Explicit Disallow: / in a non-production robots.ts is not optional.

Building a Monorepo with pnpm and TypeScript — workspace conventions and the cross-platform package rules that let us share utilities like lexicalJsonToPlainTextServerSafe cleanly
Cross-platform Lexical with use dom: monorepo gains and the bridges you still own — how Lexical's JSON serialization format works across Next.js and Expo, which is also what the plain-text extraction utility reads

Try Storyie

When you write a diary and make it public on storyie.com, all four layers fire: the entry clears the quality gate, gets its own OGP image and JSON-LD markup, lands in the next sitemap generation, and becomes findable in search results — while every private entry stays completely invisible to crawlers. That's the goal: full discoverability for the content you choose to share, zero visibility for everything else.