SEO for a diary app: making public content discoverable without compromising privacy
A diary app has an awkward relationship with SEO. Diaries are private by default — that's the whole point. But when a user explicitly flips a diary entry to public, we want search engines to find it, index it well, and send genuinely interested readers their way.
Storyie has been incrementally building out this SEO layer since launch. This post walks through the four-layer design we landed on: how we gate which pages get indexed at all, how we filter out thin content before it reaches search engines, how we produce structured metadata from freeform Lexical JSON, and how the sitemap stays alive even when the database has a bad day.
TL;DR
- Diary app SEO is not "index everything" or "block everything" — it's a staged filter where only public, quality-clearing content reaches search engines.
- Block non-production crawlers at the environment level first (
robots.ts), before anything else. - Apply a bilingual quality threshold — character count for CJK, word count for Latin script — as an AND condition to weed out thin entries.
- Build a lightweight
lexicalJsonToPlainTextServerSafeutility early; you'll need plain-text extraction in more places than you expect. - Keep
follow: trueeven onnoindexpages so link equity can still flow downstream. - A sitemap that returns a partial result on DB failure beats one that returns nothing.
| Layer | Mechanism | Purpose |
| ----- | --------- | ------- |
| 1. Index control | robots.ts + per-page noindex | Block non-production environments; exclude auth/API paths |
| 2. Content quality gate | Character count AND word count threshold | Filter thin entries before they reach crawlers |
| 3. Structured data | JSON-LD (Article), OGP, Twitter Card | Improve how indexed entries appear in search results |
| 4. Sitemap | ISR every hour, DB-failure-safe | Keep crawlers up to date on new public content |
The problem is unique to diary apps
Blog and e-commerce SEO operates on content that was produced with discoverability in mind. Diary apps start from the opposite premise.
Over 95% of content is private. Public visibility is an explicit opt-in. The default must always be private, and any SEO work has to respect that unconditionally.
Quality varies wildly. A single-sentence "tired today" entry and a 2,000-word reflective essay both come out of the same editor as "public diaries." Indexing the one-liner helps nobody.
The content structure is unpredictable. Storyie uses the Lexical rich-text editor, so a diary might be a stream of paragraphs, a mix of headings and lists, or a heavily nested document. We can't assume any particular shape when generating OGP previews or structured data.
Layer 1: Environment-level index control
The first and cheapest thing to get right is making sure crawlers can't reach non-production environments.
const IS_PRODUCTION =
process.env.NODE_ENV === "production" &&
BASE_URL.includes("storyie.com") &&
!BASE_URL.includes("staging.storyie.com");The staging environment runs on a subdomain of storyie.com. Without the explicit !BASE_URL.includes("staging.storyie.com") guard, the NODE_ENV === "production" check alone would let the staging robots.ts allow crawlers — SST deploys each environment with its own domain but the same build configuration. Staging content appearing in the production index is the kind of duplicate-content incident that's invisible until it surfaces in Search Console six months later.
Paths that are always disallow in production: /api/, /login, /account-name-setup, and anything under the dashboard. We're not trying to hide these from users; we just have no reason to send crawlers there.
Layer 2: Content quality gate
This is where the diary app specifics really show up. Even among legitimately public entries, we apply a threshold before allowing index.
export const SEO_CONTENT_QUALITY_THRESHOLDS = {
MIN_CHARACTERS: 800,
MIN_WORDS: 150,
} as const;Both conditions must pass:
export function shouldIndexDiary(diary: {
visibility: string;
content: DiaryContent | string;
}): boolean {
if (diary.visibility !== "public") return false;
return evaluateContentQuality(diary.content);
}The AND condition is intentional. Japanese has very few space-delimited tokens — a solid 800-character entry in Japanese might only tokenize to 20 "words," so a pure word count would discard it. English goes the other way: word count is a meaningful signal, but character count alone doesn't rule out padding. Running both thresholds means CJK entries clear primarily on character count and Latin-script entries clear primarily on word count, while degenerate cases on either end fail both.
The result feeds directly into generateMetadata:
robots: { index: shouldIndex, follow: true }Note that follow is always true. A short diary might still link to a longer public entry or a profile that does clear the quality bar. Dropping follow on noindex pages would cut off that traversal path unnecessarily. We control the page's own search presence independently from the crawl graph.
Layer 3: Structured data from Lexical JSON
Server-safe plain-text extraction
OGP previews, meta descriptions, quality evaluation, and sitemap lastModified dates all need the same thing: a plain-text representation of a Lexical EditorState. We wrote a single utility for this early and used it everywhere:
// lexicalJsonToPlainTextServerSafe — no Lexical runtime dependency
function extractText(node: LexicalNode): string {
if (node.type === "text") return node.text;
if (!node.children) return "";
return node.children.map(extractText).join(" ");
}The utility recursively walks the Lexical JSON tree and concatenates TextNode values. It imports nothing from Lexical. That matters because the OGP image generation path runs under Node during ISR — there's no DOM, and pulling in the full Lexical runtime for a 150-character preview would inflate cold-start cost for no reason.
We built this utility early in the project. In retrospect, it should have been one of the first things written — it shows up in far more places than we anticipated.
OGP image generation
Each public diary gets a dynamically generated OGP image via Next.js's opengraph-image.tsx:
apps/web/app/(public)/u/[slug]/diary/[diarySlug]/opengraph-image.tsxThe image pulls the first 150 characters of extracted plain text as a preview. Generation happens at request time and is cached by Next.js's OGP image infrastructure.
JSON-LD structured data
Public diary pages include Article structured data:
const structuredData = {
"@context": "https://schema.org",
"@type": "Article",
headline: title,
description: description,
datePublished: datePublished,
dateModified: dateModified,
author: { "@type": "Person", name: author.name, url: author.url },
publisher: { "@type": "Organization", name: "Storyie", url: "https://storyie.com" },
};We considered BlogPosting and CreativeWork. Article won on pragmatic grounds: it has the widest Rich Results support in Google's documentation. For a small team, the right heuristic is to pick the schema type Google actually tests against.
One security note worth being explicit about: because diary content is user-generated, we sanitize the JSON-LD before embedding it via dangerouslySetInnerHTML. Specifically, < in any user-provided field is replaced with < to prevent a </script> sequence in a diary title from closing the script block early. UGC in <script> tags is a real injection surface; this is not optional.
Canonical URLs
User content is reachable at both storyie.com/u/[slug]/... and [slug].storyie.com/.... We set the canonical to the /u/[slug] path in generateMetadata. Without an explicit canonical, both versions would compete for the same index position and fragment link equity — a predictable pitfall for multi-tenant apps that support custom subdomains.
Layer 4: ISR sitemap
The sitemap runs on a one-hour ISR cycle:
export const revalidate = 3600;Static pages, blog posts, public diaries, public notes, and user profiles all land in a single sitemap.xml. For diaries and notes, we fetch the 100 most recent public entries from Supabase:
const [publicDiaries, publicNotes] = await Promise.all([
diaryQueries.getPublicDiaries(100),
noteQueries.getPublicNotes(100),
]);The 100-entry cap is a deliberate tradeoff between crawl coverage and query cost. Fetching every public entry on every ISR cycle would produce increasingly expensive queries as the content library grows. When the volume justifies it, we'll split to a sitemap-index.xml with child sitemaps — that's the standard pattern for large UGC catalogs.
Failure mode handling
Sitemap generation can fail if the DB is temporarily unreachable. Our catch block returns the homepage entry rather than nothing:
catch (error) {
console.error("Error generating sitemap:", error);
return [{ url: BASE_URL, lastModified: new Date(), changeFrequency: "daily", priority: 1.0 }];
}An empty sitemap tells crawlers the site has no content. Even a transient DB failure could reduce crawl frequency if the sitemap consistently returns empty. Returning at least the homepage keeps the sitemap valid through the outage without requiring a fallback store or a cache layer.
Things that bit us
noindex requires conviction. The instinct is to index everything and let Google sort it out. In practice, a high volume of thin pages degrades the domain's overall crawl quality. After adding the quality gate, crawl frequency stabilized noticeably — something we confirmed in Search Console. Filtering aggressively early is less risky than it feels.
Plan for plain-text extraction before you need it. We built lexicalJsonToPlainTextServerSafe relatively early, but we still had to retrofit it into a couple of places. If you're building on Lexical for a public-content product, this utility should be among your first infrastructure pieces, not an afterthought.
Staging robots.ts is Day 1 infrastructure. It's easy to assume that a staging URL no one knows about won't get indexed. Google finds subdomains through certificate transparency logs, link discovery, and referrer headers. Explicit Disallow: / in a non-production robots.ts is not optional.
Related Posts
- Building a Monorepo with pnpm and TypeScript — workspace conventions and the cross-platform package rules that let us share utilities like
lexicalJsonToPlainTextServerSafecleanly - Cross-platform Lexical with
use dom: monorepo gains and the bridges you still own — how Lexical's JSON serialization format works across Next.js and Expo, which is also what the plain-text extraction utility reads
Try Storyie
When you write a diary and make it public on storyie.com, all four layers fire: the entry clears the quality gate, gets its own OGP image and JSON-LD markup, lands in the next sitemap generation, and becomes findable in search results — while every private entry stays completely invisible to crawlers. That's the goal: full discoverability for the content you choose to share, zero visibility for everything else.