What is llms.txt and why does it matter for discoverability?

llms.txt is a plain-text file placed at the root of a website (e.g. /llms.txt) that describes the site in Markdown — what it is, what content it has, and where the important pages live. It is to AI assistants what robots.txt is to search engine crawlers, except instead of controlling access it provides structured context. When a user asks an AI assistant "recommend a journaling app," the assistant may retrieve context from the site before answering. llms.txt makes that retrieval cheap and accurate: instead of parsing HTML, the AI gets a concise, structured summary. The spec was proposed in 2024 and has already been adopted by Stripe, Cloudflare, and Anthropic, which gives it enough momentum to treat as a de facto standard worth implementing now.

Why use a Route Handler instead of a static public/llms.txt file?

A static file is fine for blogs or marketing sites where content changes slowly. Storyie has user-generated content — public diaries and notes that change whenever a user publishes something new. A static file would go stale immediately. Using a Next.js Route Handler with ISR (revalidate = 3600) means the file is regenerated from the database at most once per hour, automatically. The DB query runs at revalidation time, not on every request, so there is no per-request latency cost. For UGC apps, dynamic generation is the only option that stays accurate.

How does the Markdown route handler avoid duplicate content penalties in search?

Each public diary already has a canonical HTML page at /diary/[slug]. The Markdown endpoint at /diary/[slug].md is intended for AI crawlers, not for search engines. We add X-Robots-Tag: noindex to every Markdown response, which tells Google and other crawlers not to index that URL. The HTML version remains the canonical page for SEO. This means the same content is accessible at two URLs but only one is indexed — the right outcome for serving both search engines and LLM crawlers optimally.

How is privacy handled — can a private diary appear in llms.txt?

No. Privacy is enforced at the database query layer. The queries that populate llms.txt (getPublicDiarySummaries, getPublicNoteSummaries) filter on is_public = true at the SQL level, before results are serialized into the file. The Markdown route handler (getPublicDiaryBySlugWithAuthor) applies the same filter — even if someone constructs a .md URL for a private diary slug, the query returns nothing and the handler returns 404. The application-layer filter pairs with Supabase RLS at the database layer for defense in depth.

What does the Lexical JSON-to-Markdown conversion look like in practice?

Storyie stores diary content as Lexical editor state — a JSON tree with typed nodes (HeadingNode, ListNode, etc.). To serve that content as Markdown we call lexicalJsonToMarkdown() from @storyie/lexical-common, the platform-agnostic package shared between web and Expo. That function walks the node tree and emits the appropriate Markdown syntax. Keeping the conversion logic in the shared package is deliberate: the package already understands the full node schema, so changes to custom node types only need to be handled in one place rather than once per output format.

What caching strategy works best for AI crawler traffic?

AI crawler access patterns are unpredictable — bursts can happen at any time, and the crawlers generally tolerate slightly stale content. We use Cache-Control: public, s-maxage=3600, stale-while-revalidate=86400 on Markdown responses. s-maxage=3600 lets CDN edge nodes cache for one hour; stale-while-revalidate=86400 allows serving the cached version for up to 24 hours more while revalidation happens in the background. The result is that nearly every AI request hits the CDN rather than the origin, DB load stays low, and the content is never more than a day stale — which is acceptable for a diary app where the goal is discoverability, not real-time accuracy.

Making Storyie discoverable to AI: llms.txt and Markdown route handlers in Next.js

Websites have robots.txt to tell search engine crawlers what they can access, and sitemap.xml to tell them what exists. Neither helps much when an AI assistant is trying to understand what a site is about.

Since late 2024 there has been a growing answer to that gap: llms.txt. The idea is simple — put a plain Markdown file at /llms.txt that describes your site's purpose, structure, and key content. AI assistants retrieving context before answering queries can read it far more efficiently than parsing your HTML.

We added llms.txt and a set of Markdown route handlers to Storyie's Next.js app. This post covers the implementation decisions: why we generate dynamically instead of statically, how Lexical JSON becomes Markdown, how we keep private diaries private, and what caching strategy makes sense for AI crawler traffic.

TL;DR

llms.txt is to AI assistants what robots.txt is to search crawlers — a structured, machine-readable description of your site.
For a UGC app, generate it dynamically via a Next.js Route Handler with ISR (revalidate = 3600) rather than committing a static file.
Serve public diary content as Markdown at .md URLs for LLM readability; add X-Robots-Tag: noindex to prevent search engine duplication.
Filter for public-only content at the query layer — privacy enforcement belongs in SQL, not in the serialization step.
Use stale-while-revalidate caching: AI crawlers tolerate mildly stale content, and unpredictable burst traffic should hit the CDN, not the database.

Concern	Approach
llms.txt freshness	Route Handler + `revalidate = 3600` (ISR)
Content format for LLMs	Markdown with YAML frontmatter at `/diary/[slug].md`
Lexical → Markdown	`lexicalJsonToMarkdown()` from `@storyie/lexical-common`
Privacy	`is_public = true` filter at the SQL query layer
Search engine dedup	`X-Robots-Tag: noindex` on every Markdown response
Cache strategy	`s-maxage=3600, stale-while-revalidate=86400`

What llms.txt actually is

The file lives at the site root (/llms.txt) and uses a defined Markdown structure:

# Site Name

> One-line summary

Longer description.

## Section

- [Page Name](URL): description
- [Page Name](URL): description

## Optional

- [Privacy Policy](URL): legal details

Where robots.txt controls crawler access, llms.txt provides context. ChatGPT, Claude, and Perplexity can read a well-written llms.txt in a fraction of the tokens it would take to parse the site's HTML, and the structure makes it easier to extract the right information. The spec is still informal but adoption by Stripe, Cloudflare, and Anthropic is enough momentum to treat it as worth implementing.

Why dynamic generation

For a static blog, public/llms.txt is entirely reasonable. Storyie has user-generated content — public diaries and notes that appear and disappear as users change their visibility settings. A static file committed to the repo would be stale by the next deployment.

We use a Route Handler with ISR instead:

// app/llms.txt/route.ts
import { diaryQueries } from "@/lib/db/queries/diary";
import { noteQueries } from "@/lib/db/queries/notes";

export const revalidate = 3600; // ISR: regenerate at most once per hour

export async function GET() {
  const [publicDiaries, publicNotes] = await Promise.all([
    diaryQueries.getPublicDiarySummaries(100),
    noteQueries.getPublicNoteSummaries(100),
  ]);

  const content = `# Storyie

> A diary and storytelling platform where personal thoughts become shareable stories.

Storyie is a cross-platform journaling app ...

## Public Diaries

${publicDiaries.map(({ diary, author }) =>
  `- [${author?.slug}'s Diary - ${formatDate(diary.diaryDatetime)}](${baseUrl}/diary/${diary.slug}.md)`
).join("\n")}

## Public Notes

${publicNotes.map(({ note }) =>
  `- [${note.title || "Untitled Note"}](${baseUrl}/note/${note.slug}.md)`
).join("\n")}
`;

  return new Response(content, {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}

The ISR revalidate = 3600 means the first request after an hour triggers a background regeneration. Subsequent requests within that window get the cached version from the CDN. The DB query runs once per window rather than once per request, so the cost is negligible.

The llms.txt structure we settled on for Storyie:

Section	Content	Purpose
Header	Site name and one-line pitch	Immediate intent signal for the AI
Features	Links to key pages	Full feature surface at a glance
Blog	Links to engineering posts	Detailed context about how the service works
Public Diaries	Dynamic list of public diaries	UGC content for the AI to reference
Public Notes	Dynamic list of public notes	UGC content for the AI to reference
Optional	Privacy policy, terms	Available if the AI needs legal context

Markdown route handlers

The links in llms.txt point to .md URLs. That means we need endpoints that actually return diary content as Markdown.

// app/(public)/diary/[slug].md/route.ts
export async function GET(_req: Request, { params }: { params: Promise<{ slug: string }> }) {
  const { slug } = await params;
  const entry = await diaryQueries.getPublicDiaryBySlugWithAuthor(slug);

  if (!entry) {
    return new Response("Not found", { status: 404 });
  }

  const { diary, author } = entry;
  const title = extractTitle(diary.content) ?? `${author?.slug}'s Diary`;
  const markdown = contentToMarkdown(diary.content);

  const frontmatter = `---
title: ${yamlEscape(title)}
author: ${yamlEscape(author?.slug ?? "")}
published: ${formatDate(diary.diaryDatetime)}
url: ${baseUrl}/diary/${diary.slug}.md
---

`;

  return new Response(frontmatter + markdown, {
    headers: {
      "Content-Type": "text/markdown; charset=utf-8",
      "Cache-Control": "public, s-maxage=3600, stale-while-revalidate=86400",
      "X-Robots-Tag": "noindex",
    },
  });
}

Lexical JSON to Markdown

Storyie's editor is Lexical-based, so diary content is stored as a JSON tree — HeadingNode, ListNode, ParagraphNode, and so on. The shared package @storyie/lexical-common already contains a lexicalJsonToMarkdown() function that walks that tree and emits Markdown. Keeping the conversion there is deliberate: the package owns the node schema, so any new custom node type only needs a Markdown serializer added in one place.

// lib/utils/markdown-utils.ts
import { lexicalJsonToMarkdown } from "@storyie/lexical-common";

export function contentToMarkdown(content: unknown): string {
  const c = content as Record<string, unknown>;
  if (c.root) {
    return lexicalJsonToMarkdown(JSON.stringify(content));
  }
  if (typeof c.text === "string") {
    return c.text;
  }
  return "";
}

YAML frontmatter for structured metadata

The frontmatter at the top of each Markdown response gives AI assistants structured access to metadata — title, author, publication date, canonical URL — without requiring them to parse the prose. User input goes directly into that YAML, so escaping is non-negotiable:

export function yamlEscape(value: string): string {
  if (/[:#"'\n\r\t[\]{}|>!&*?,]/.test(value) || value.trim() !== value) {
    return `"${value.replace(/\\/g, "\\\\").replace(/"/g, '\\"')}"`;
  }
  return value;
}

Title extraction from content

Diaries in Storyie have no separate title field — the first heading in the content serves as the title. The extractor walks the Lexical node tree to find it:

export function extractTitle(content: unknown): string | null {
  // Walk root.children looking for the first heading node
  for (const node of root.children) {
    if (node.type === "heading" && node.children) {
      const text = node.children.map((child) => child.text ?? "").join("");
      if (text.trim()) return text.trim();
    }
  }
  return null;
}

If there is no heading, the title falls back to "${author}'s Diary".

X-Robots-Tag: noindex

The Markdown endpoint is for AI assistants. It is not the canonical URL for the diary — the HTML page at /diary/[slug] is. Without noindex, Google would index both, creating a duplicate content problem. X-Robots-Tag: noindex on the Markdown response tells search crawlers to skip it while leaving LLM access open.

Privacy

Every query that feeds into llms.txt or the Markdown route handler filters on is_public = true at the SQL level. There is no post-query filtering step that could be bypassed:

// diary.ts
export const diaryQueries = {
  getPublicDiarySummaries: async (limit: number) => {
    // Only rows where is_public = true
  },
  getPublicDiaryBySlugWithAuthor: async (slug: string) => {
    // Only rows where is_public = true AND slug matches
  },
};

A request to /diary/some-private-slug.md returns 404 because the query finds nothing — the filter happens before any content is serialized. This pairs with Supabase RLS at the database layer. Two independent enforcement points for the same rule.

Cache strategy for AI crawlers

AI crawler traffic is unpredictable. A mention in a widely-used AI assistant's context window can trigger bursts of requests at any time, and those bursts should hit the CDN edge, not the database.

The Markdown responses use:

Cache-Control: public, s-maxage=3600, stale-while-revalidate=86400

s-maxage=3600 gives CDN nodes a one-hour fresh window. stale-while-revalidate=86400 extends that to 24 hours for background revalidation — the CDN returns the cached version immediately while fetching a fresh copy behind the scenes. For a diary app, 24-hour staleness is acceptable: the goal of these endpoints is discoverability, not real-time accuracy.

What we learned

llms.txt is cheap to add and hard to justify skipping. A single Route Handler and an hour of writing good descriptive copy is the entire implementation cost. The marginal value — being accurately represented when someone asks an AI assistant about journaling tools — is hard to measure but straightforward to reason about.

Dynamic generation is the right default for UGC apps. The static-file approach only works if your content doesn't change. Any app with user-generated public content needs to regenerate the file from the database. ISR makes that cheap.

The Markdown conversion layer earns its place in the shared package. Because @storyie/lexical-common already owns the node schema, adding Markdown serialization there means every surface — web rendering, Expo display, and LLM output — uses the same logic. A new custom node type gets a Markdown serializer once and works everywhere.

noindex + LLM access is an explicit design choice, not a side effect. Search engines see the HTML version; AI crawlers see the Markdown version. That separation lets us optimize each output for its consumer without either interfering with the other.

The spec is still evolving, but momentum is real. llms.txt was proposed in 2024 and is not yet formally standardized. Stripe, Cloudflare, and Anthropic have adopted it anyway. Implementing it now costs almost nothing; waiting until it is formalized costs discoverability in the meantime.

Cross-platform Lexical with use dom: monorepo gains and the bridges you still own — how @storyie/lexical-common is structured and why the shared package owns the node schema
Building a Monorepo with pnpm and TypeScript — workspace conventions and cross-package dependency rules

Try Storyie

If you want to see what this looks like in production, visit storyie.com/llms.txt and compare it to the iOS app. The same diary content that renders as rich text in the app is served as structured Markdown for any AI assistant that wants to read it.

Making Storyie discoverable to AI: llms.txt and Markdown route handlers in Next.js

TL;DR

What llms.txt actually is

Why dynamic generation

Markdown route handlers

Lexical JSON to Markdown

YAML frontmatter for structured metadata

Title extraction from content

X-Robots-Tag: noindex

Privacy

Cache strategy for AI crawlers

What we learned

Related Posts

Try Storyie

Testing a Next.js + Expo monorepo: four layers, one CI pipeline

Supabase Auth across Next.js and Expo: what actually takes work

Deploy Next.js to AWS with SST v3 (not Vercel)

TL;DR

What llms.txt actually is

Why dynamic generation

Markdown route handlers

Lexical JSON to Markdown

YAML frontmatter for structured metadata

Title extraction from content

X-Robots-Tag: noindex

Privacy

Cache strategy for AI crawlers

What we learned

Related Posts

Try Storyie

Related posts

Testing a Next.js + Expo monorepo: four layers, one CI pipeline

Supabase Auth across Next.js and Expo: what actually takes work

Deploy Next.js to AWS with SST v3 (not Vercel)