#004Learning

I spent a day researching SEO instead of shipping code. Here's why that was the right call.

robots.txt: 404. Sitemap: pointing to a domain I don't own. AI crawlers: couldn't see me. I stopped shipping code and spent a day reading documentation. Best decision all week.

Four days in. Site works. Three Field Notes published, nine tools documented, Contentful delivering content, Vercel deploying it. The machine is running.

So naturally I did the thing that feels completely wrong when you're building in public. I stopped building. I opened a bunch of documentation tabs and started reading.

Here's what I found and why I'm glad I looked before I shipped anything else.


Why I stopped shipping

I googled my own site. The result looked great. Right title, right description. Google had already re-crawled since I set up the Contentful SEO entry yesterday. Nice.

Then I ran two curls and the nice feeling disappeared.

curl -s https://www.fieldnotes-ai.com/robots.txt returned a 404. Not a text file. A full HTML error page. Every crawler hitting the site gets zero guidance on what to do.

curl -s https://www.fieldnotes-ai.com/sitemap.xml returned valid XML, which was encouraging until I actually read the URLs. Every single one pointed to https://fieldnotes.ai. No hyphen. No www. A domain I don't even own. My sitemap was giving Google directions to someone else's house.

Field Note #003 was missing from the sitemap entirely. The <head> still had <meta name="generator" content="v0.app"/> in it, basically wearing a sign that says "I was scaffolded." And the OG image? My 512x512 favicon. So every time someone shares a link on X or LinkedIn, the preview card looks like a broken thumbnail.

The site ranks fine for its own name. But "fine" felt like a low bar for a site that's supposed to demonstrate that I know what I'm doing.

Time to actually learn what I'm doing.


Reading my own company's SEO guide

I work at Contentful. I advise SI partners on how to implement it. And I had never read our official SEO guide end to end.

Yeah, I know. I fixed that today.

It's a 9-chapter resource from our Practice Architect and SEO Lead. The most useful takeaway: if you have multiple page types, create a dedicated SEO Metadata content type and reference it from each page type. Our documentation calls this a "fixed assembly" pattern. Define fields once, reference everywhere, keep things clean.

My Contentful space already follows this pattern (set it up in Field Note #001). But comparing what my SEO content type has today against what the guide recommends, I noticed two missing fields: SEO Title and Meta Description.

These are different from OG Title and OG Description, which I did have. Quick explanation because this confused me for a second:

SEO Title and Meta Description control what Google shows in search results. The blue link text and the grey snippet underneath. OG Title and OG Description control what shows up when someone shares your URL on X, LinkedIn, or Slack. Different audiences, different platforms, different optimization goals. You want both pairs, with the OG fields falling back to the SEO fields when left empty.

My content type had the social sharing half but not the search engine half. Fixing that is step one of the implementation plan.


What the SEO industry actually says

I went beyond our own guide and into what the two biggest names in SEO tooling recommend.

If you've ever done keyword research or run a technical audit on a website, you've probably used Ahrefs or Semrush. They're the industry standard. I spent a few hours reading their guides on meta tags, title tags, meta descriptions, Open Graph, robots directives, canonical URLs, and structured data.

They agree on almost everything. Where they disagree is actually kind of fun: meta description length. Ahrefs says up to 160 characters. Semrush says aim for 105 because Google truncates earlier on mobile. The move: put the important stuff in the first 105 characters, use the rest as bonus space that might get cut.

Both are crystal clear that meta keywords are dead. Google hasn't looked at them since 2009. Semrush says Bing might actually penalize you for using them. My Contentful SEO type has a meta keywords field. It's staying but I'm not losing sleep over it.

For Open Graph, both say four tags are required: og:title, og:url, og:image, and og:type. The image should be 1200x630 pixels. Mine is a 512x512 favicon. Cool cool cool.

On structured data, both say JSON-LD is the way to go. For blog posts, use BlogPosting schema with headline, author, dates. It's not a ranking factor directly, but it gets you richer search result displays which means better click-through rates.


AI crawlers are way more interesting than I thought

Okay, this was the rabbit hole I didn't plan on. And it turned out to be the most useful research of the day.

Every major AI company now runs multiple web crawlers. Not one bot. Multiple. Each with a different job. And you can control them separately.

OpenAI runs three: GPTBot for model training, OAI-SearchBot for ChatGPT search indexing, and ChatGPT-User for real-time fetches when someone asks ChatGPT to look something up.

Anthropic mirrors this with ClaudeBot (training), Claude-SearchBot (search quality), and Claude-User (user queries). All three respect robots.txt.

Perplexity uses PerplexityBot for indexing and Perplexity-User for live retrieval.

Here's the thing I didn't realize: I can block the training bots while allowing the search bots. One robots.txt file gives me granular control. My content gets retrieved when someone searches in ChatGPT or Claude (think of it like RAG, retrieval-augmented generation, not permanent training), but it doesn't get baked into future model weights. That's exactly what I want.

Google is the odd one out. There's no separate Google AI search bot. Allow Googlebot, and your content is eligible for AI Overviews automatically. Google-Extended only controls Gemini training. You can't currently appear in regular Google search while opting out of AI Overviews. Regulators are pushing on that, but for now it's all or nothing.

One thing that's easy to miss: OpenAI's crawlers don't run JavaScript. If your content only exists after client-side rendering, AI crawlers can't see it. Like, at all. Next.js App Router with server components sends fully rendered HTML by default, so I'm fine here.

But here's what I keep thinking about. Google does render JavaScript. Google indexes the fully rendered version of every page. And ChatGPT search reportedly uses Bing's index as part of its search pipeline. So even if OpenAI's own crawler can't run my JavaScript, couldn't it still access my content through Google's or Bing's already-rendered index? Does the JavaScript rendering limitation even matter if the big search engines have already done the rendering work and made that content available?

I don't have a confirmed answer on this. But if it's true, it means the distinction between "AI bots can't render JS" and "your content is invisible to AI" might be more nuanced than people think. Something to dig into more.


Emerging standards: what's real and what's hype

I looked into the newer proposals for making websites AI-friendly. Here's the honest version.

llms.txt is the most interesting. It's a Markdown file at your site root that gives AI systems a curated overview of what's on your site. Think table of contents, but for machines. Proposed by Jeremy Howard in September 2024. Anthropic's docs site has one. Vercel has one.

But here's the thing: no major LLM provider has confirmed their crawlers actually read llms.txt files. Semrush covered this and Google's John Mueller compared it to the deprecated keywords meta tag. It costs almost nothing to implement though, so I'm treating it as a cheap bet on where the ecosystem might go.

ai.txt exists in at least three competing proposals. None with real adoption. Skipping entirely.

The real one to watch is the IETF AIPREF working group. Chartered January 2025, building an actual internet standard for AI content preferences, co-authored by engineers from Google and Mozilla. When that ships, it'll matter. It hasn't shipped yet.

My strategy: implement what's proven (robots.txt with AI crawler rules, JSON-LD, server-side rendering). Add the cheap bets (llms.txt). Skip the rest until it solidifies.


The plan

No code shipped today. Zero PRs. The site looks exactly the same as yesterday.

But I went from "I think my SEO is probably fine" to having a concrete implementation plan for three Claude Code sessions:

Session 1 touches the Contentful content model. Adding SEO Title and Meta Description to the SEO type. Adding author name and Twitter handle to Global Settings. Adding a Last Updated field to Field Notes. Small changes, big downstream impact.

Session 2 is the real work. Creating robots.txt with rules for 11 AI user agents. Fixing the broken sitemap. Wiring Contentful SEO fields into Next.js generateMetadata so pages get their tags from the CMS instead of hardcoded strings. JSON-LD structured data everywhere.

Session 3 is polish. A proper 1200x630 OG image to replace the favicon. The llms.txt file. Google Search Console setup.

Contentful makes the content model part genuinely easy. The SEO type stays standalone, referenced by both Global Settings (for site defaults) and individual Field Notes (for per-page overrides). No Field Note SEO entry? Falls back to the global default. No OG Title? Falls back to the SEO Title. The content architecture handles all of this without frontend logic. That's the whole point of modeling your content properly. You set up the relationships once and they just work.


One more thing

Total Claude Code cost across all sessions since March 1: $7.50. 10.6 million tokens. 95% cache reads. Today's session was just an MCP fetch to pull the content structure for planning, but ccusage doesn't break it out individually.

The research and planning conversation in Claude? Bundled into the Max plan. Not trackable. But I can tell you it was many, many more tokens than that MCP fetch.