On this page
- Purpose
- The SEO Black Hole
- API Routes with Content Collection Queries
- Understanding Build-Time Data Fetching
- Prerequisites & Tooling
- Knowledge Base
- Environment
- Testing Your Sitemap
- High-Level Architecture
- Sitemap Generation Flow
- The Phone Book
- The Three-Phase Architecture
- The Implementation
- Defining the API Route
- Querying Content Collections
- Defining Static Pages
- Generating XML for Static Pages
- Adding Dynamic Content Pages
- Handling Edge Cases in the Repo
- Returning the Response
- Complete Implementation
- Under the Hood
- Build-Time Execution
- Memory Efficiency
- XML Escaping
- Edge Cases & Pitfalls
- Missing Site URL
- Duplicate URLs
- Invalid Dates
- Trailing Slashes
- Forgetting robots.txt
- Not Submitting to Search Engines
- Conclusion
- Skills Acquired
- Extending the Sitemap
Purpose
The SEO Black Hole
You’ve built a beautiful portfolio with 50 projects, 30 tutorials, and 10 categories. You deploy to production. Then you check Google Search Console and see… nothing. Your pages aren’t being indexed.
The issue: Search engines don’t know your pages exist. They could crawl your site randomly, but that’s inefficient. What they really want is a sitemap—an XML file that lists every URL on your site with metadata about priority and update frequency.
You could manually create sitemap.xml:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://yoursite.com/projects/project-1</loc>
<priority>0.8</priority>
</url>
<url>
<loc>https://yoursite.com/projects/project-2</loc>
<priority>0.8</priority>
</url>
<!-- ... 48 more projects -->
</urlset>
But this creates three problems:
- Maintenance Nightmare: Every new project requires manually editing XML
- Stale Data: Last modified dates are wrong or missing
- Human Error: One typo breaks the entire sitemap
The Core Problem: Content is dynamic (stored in Markdown files), but sitemaps are static (XML files). We need to generate sitemaps programmatically from our content at build time.
API Routes with Content Collection Queries
The code we’re analyzing (src/pages/sitemap.xml.ts) implements a dynamic sitemap generator that:
- Queries all content collections (projects, tutorials, categories)
- Filters out draft content
- Generates XML with proper priorities and change frequencies
- Includes last modified dates from frontmatter
- Serves the result as an API endpoint
This is the same pattern used by:
- WordPress (automatic sitemap generation)
- Next.js (next-sitemap plugin)
- Gatsby (gatsby-plugin-sitemap)
Understanding Build-Time Data Fetching
This tutorial demonstrates three advanced concepts:
- API Routes: Creating endpoints that return non-HTML responses
- XML Generation: Programmatically building valid XML documents
- SEO Optimization: Understanding search engine crawling behavior
🔵 Deep Dive: Sitemaps are part of the Sitemaps Protocol (sitemaps.org), a standard supported by Google, Bing, Yahoo, and other search engines. Proper sitemaps can improve indexing speed by 50-70%.
Prerequisites & Tooling
Knowledge Base
Required:
- TypeScript/JavaScript basics
- Understanding of XML structure
- Familiarity with Astro Content Collections
- Basic SEO concepts (what search engines do)
Helpful:
- Experience with API routes
- Understanding of HTTP headers
- Knowledge of sitemap protocols
Environment
From the project’s configuration:
// astro.config.mjs
export default defineConfig({
site: 'https://jasontran.pages.dev', // Required for sitemap generation
integrations: [sitemap()], // Built-in sitemap integration
});
Key Concepts:
- API Route: A file in
src/pages/that exports a function instead of a component - GET Handler: Function that handles HTTP GET requests
- Content Collections: Astro’s type-safe content management system
- XML: Extensible Markup Language for structured data
Testing Your Sitemap
# Build the site
npm run build
# Preview locally
npm run preview
# Visit the sitemap
curl http://localhost:4321/sitemap.xml
# Validate XML
xmllint --noout sitemap.xml # Linux/Mac
# Or use online validators: https://www.xml-sitemaps.com/validate-xml-sitemap.html
High-Level Architecture
Sitemap Generation Flow
graph TB
A[Build Process Starts] --> B[Astro Processes sitemap.xml.ts]
B --> C[GET Handler Executes]
C --> D[Query Content Collections]
D --> E[getCollection: projects]
D --> F[getCollection: tutorials]
D --> G[getCollection: categories]
E --> H[Filter Drafts]
F --> H
G --> H
H --> I[Generate XML String]
I --> J[Static Pages URLs]
I --> K[Project URLs + Dates]
I --> L[Tutorial URLs + Dates]
I --> M[Category URLs]
J --> N[Combine All URLs]
K --> N
L --> N
M --> N
N --> O[Return Response with XML Headers]
O --> P[sitemap.xml Available at Build]
style C fill:#a855f7
style I fill:#10b981
style O fill:#f59e0b
The Phone Book
Think of a sitemap as a phone book for search engines:
| Phone Book | Sitemap |
|---|---|
| Names & numbers | URLs |
| Alphabetical order | Priority ranking |
| ”Updated 2024” | Last modified dates |
| Business vs. Residential | Page types (static vs. dynamic) |
| Yellow pages sections | Categories |
When a search engine visits your site, it first checks the phone book (sitemap) to understand:
- What pages exist
- Which are most important
- When they were last updated
- How often they change
The Three-Phase Architecture
Phase 1: Data Collection (Build Time)
├─ Query all content collections
├─ Filter out drafts
└─ Extract metadata (dates, slugs)
Phase 2: XML Generation (String Building)
├─ Create XML header
├─ Add static pages
├─ Add dynamic content pages
└─ Close XML structure
Phase 3: Response Delivery (HTTP)
├─ Set Content-Type header
├─ Return XML string
└─ Cache at CDN edge
The Implementation
Defining the API Route
Naive Approach: Static XML File
<!-- public/sitemap.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://yoursite.com/</loc>
</url>
</urlset>
Why This Fails: Every new project requires manually editing this file. No automation.
Refined Solution (From Repo):
// src/pages/sitemap.xml.ts
import type { APIRoute } from 'astro';
export const GET: APIRoute = async ({ site }) => {
// This function runs at BUILD TIME
// It generates the sitemap dynamically
const siteUrl = site?.toString() || 'https://yoursite.com';
// ... generate XML
return new Response(xmlString, {
headers: {
'Content-Type': 'application/xml; charset=utf-8',
},
});
};
🔴 Danger: The filename must be sitemap.xml.ts (or .js). The .xml extension tells Astro to serve it as XML, while .ts allows TypeScript code.
Querying Content Collections
The Challenge: Get all published content from multiple collections.
import { getCollection } from 'astro:content';
// Get all published projects (filter out drafts)
const projects = await getCollection('projects', ({ data }) => !data.draft);
// Get all published tutorials
const tutorials = await getCollection('tutorials', ({ data }) => !data.draft);
// Get all categories (no draft field)
const categories = await getCollection('category');
Key Insights:
-
Filter Function: The second argument to
getCollectionis a predicate({ data }) => !data.draft // Equivalent to: (entry) => entry.data.draft !== true -
Type Safety: TypeScript knows the shape of
databased on your schemaprojects[0].data.publishDate // Date (type-safe!) projects[0].data.title // string projects[0].id // string (filename without extension) -
Async Queries:
getCollectionis async because it reads from disk
Defining Static Pages
Configuration Object Pattern:
const staticPages = [
{ url: '', changefreq: 'weekly', priority: 1.0 },
{ url: 'projects', changefreq: 'weekly', priority: 0.9 },
{ url: 'tutorials', changefreq: 'weekly', priority: 0.9 },
{ url: 'terminal', changefreq: 'monthly', priority: 0.8 },
{ url: 'about', changefreq: 'monthly', priority: 0.7 },
{ url: 'contact', changefreq: 'monthly', priority: 0.7 },
{ url: 'freelance', changefreq: 'monthly', priority: 0.8 },
];
Understanding the Fields:
- url: Path relative to site root (empty string = homepage)
- changefreq: How often the page changes
always: Changes every time it’s accessed (e.g., live data)hourly: News sitesdaily: Blogsweekly: Project listingsmonthly: About pagesyearly: Legal pagesnever: Archived content
- priority: Relative importance (0.0 to 1.0)
- 1.0: Homepage
- 0.8-0.9: Main sections
- 0.5-0.7: Individual pages
- 0.0-0.4: Low-priority pages
🔵 Deep Dive: changefreq is a hint, not a directive. Search engines use it to optimize crawl frequency but may ignore it if they detect different patterns.
Generating XML for Static Pages
Template Literal Pattern:
const sitemap = `<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
${staticPages.map(page => `
<url>
<loc>${siteUrl}${page.url}</loc>
<changefreq>${page.changefreq}</changefreq>
<priority>${page.priority}</priority>
</url>`).join('')}
</urlset>`;
Why Template Literals?
- Readability: XML structure is visible
- Interpolation: Easy to inject variables
- Multiline: No string concatenation
Alternative: XML Builder Library
// Using a library like 'xmlbuilder2'
import { create } from 'xmlbuilder2';
const root = create({ version: '1.0', encoding: 'UTF-8' })
.ele('urlset', { xmlns: 'http://www.sitemaps.org/schemas/sitemap/0.9' });
staticPages.forEach(page => {
root.ele('url')
.ele('loc').txt(`${siteUrl}${page.url}`).up()
.ele('changefreq').txt(page.changefreq).up()
.ele('priority').txt(page.priority.toString()).up();
});
const sitemap = root.end({ prettyPrint: true });
Comparison:
| Approach | Pros | Cons |
|---|---|---|
| Template Literals | Simple, no dependencies | Manual escaping, harder to validate |
| XML Builder | Type-safe, auto-escaping | Extra dependency, more verbose |
For sitemaps (simple structure, trusted data), template literals are sufficient.
Adding Dynamic Content Pages
Projects with Last Modified Dates:
${projects.map(project => `
<url>
<loc>${siteUrl}projects/${project.id}</loc>
<lastmod>${project.data.publishDate.toISOString()}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>`).join('')}
Key Details:
-
URL Construction:
${siteUrl}projects/${project.id}project.idis the filename without extension- For
src/content/projects/simple-router.mdx,id="simple-router"
-
Date Formatting:
toISOString()- Converts
Dateto ISO 8601 format:"2024-01-15T10:30:00.000Z" - Required format for
<lastmod>tags
- Converts
-
Priority Logic: Projects get 0.8 (high priority, but below main sections)
Tutorials (Similar Pattern):
${tutorials.map(tutorial => `
<url>
<loc>${siteUrl}tutorials/${tutorial.id}</loc>
<lastmod>${tutorial.data.publishDate.toISOString()}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.7</priority>
</url>`).join('')}
Categories (No Last Modified):
${categories.map(category => `
<url>
<loc>${siteUrl}category/${category.data.slug || category.id}</loc>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
</url>`).join('')}
🔴 Danger: Notice category.data.slug || category.id. This handles cases where the category has a custom slug. Always provide fallbacks for optional fields.
Handling Edge Cases in the Repo
The Repo’s Approach (With Bug):
${categories.filter(c => c.data.type === 'project').map(category => `
<url>
<loc>${siteUrl}projects/category/${category.data.slug || category.id}</loc>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
</url>`).join('')}
${categories.filter(c => c.data.type === 'project').map(category => `
<url>
<loc>${siteUrl}freelance/${category.data.slug || category.id}</loc>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>`).join('')}
Issue: The code filters for c.data.type === 'project', but the category schema doesn’t define a type field. This will return empty arrays.
Fixed Version:
// Remove the filter or add 'type' to category schema
${categories.map(category => `
<url>
<loc>${siteUrl}category/${category.data.slug || category.id}</loc>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
</url>`).join('')}
Returning the Response
return new Response(sitemap, {
headers: {
'Content-Type': 'application/xml; charset=utf-8',
},
});
Critical Headers:
Content-Type: application/xml: Tells browsers/crawlers this is XMLcharset=utf-8: Ensures proper encoding for international characters
Optional Headers (Production Enhancement):
return new Response(sitemap, {
headers: {
'Content-Type': 'application/xml; charset=utf-8',
'Cache-Control': 'public, max-age=3600', // Cache for 1 hour
'X-Robots-Tag': 'noindex', // Don't index the sitemap itself
},
});
Complete Implementation
Here’s the full sitemap generator from the repository (with fixes):
import type { APIRoute } from 'astro';
import { getCollection } from 'astro:content';
export const GET: APIRoute = async ({ site }) => {
const siteUrl = site?.toString() || 'https://yoursite.com';
// Get all published content
const projects = await getCollection('projects', ({ data }) => !data.draft);
const tutorials = await getCollection('tutorials', ({ data }) => !data.draft);
const categories = await getCollection('category');
// Static pages with priorities
const staticPages = [
{ url: '', changefreq: 'weekly', priority: 1.0 },
{ url: 'projects', changefreq: 'weekly', priority: 0.9 },
{ url: 'tutorials', changefreq: 'weekly', priority: 0.9 },
{ url: 'terminal', changefreq: 'monthly', priority: 0.8 },
{ url: 'about', changefreq: 'monthly', priority: 0.7 },
{ url: 'contact', changefreq: 'monthly', priority: 0.7 },
{ url: 'freelance', changefreq: 'monthly', priority: 0.8 },
];
// Generate sitemap XML
const sitemap = `<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
${staticPages.map(page => `
<url>
<loc>${siteUrl}${page.url}</loc>
<changefreq>${page.changefreq}</changefreq>
<priority>${page.priority}</priority>
</url>`).join('')}
${projects.map(project => `
<url>
<loc>${siteUrl}projects/${project.id}</loc>
<lastmod>${project.data.publishDate.toISOString()}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>`).join('')}
${tutorials.map(tutorial => `
<url>
<loc>${siteUrl}tutorials/${tutorial.id}</loc>
<lastmod>${tutorial.data.publishDate.toISOString()}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.7</priority>
</url>`).join('')}
${categories.map(category => `
<url>
<loc>${siteUrl}category/${category.data.slug || category.id}</loc>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
</url>`).join('')}
</urlset>`;
return new Response(sitemap, {
headers: {
'Content-Type': 'application/xml; charset=utf-8',
},
});
};
Under the Hood
Build-Time Execution
When Does This Run?
npm run build
↓
Astro processes all pages
↓
Finds sitemap.xml.ts
↓
Executes GET function
↓
Queries content collections (reads from disk)
↓
Generates XML string
↓
Writes to dist/sitemap.xml
↓
File is served statically
Performance Characteristics:
For a site with 100 projects + 50 tutorials:
- Content collection queries: ~50ms
- XML string generation: ~10ms
- File write: ~5ms
- Total: ~65ms (one-time build cost)
Once built, the sitemap is a static file served from CDN with zero runtime cost.
Memory Efficiency
String Concatenation Analysis:
${projects.map(project => `...`).join('')}
What Happens:
map()creates an array of strings:["<url>...</url>", "<url>...</url>", ...]join('')concatenates them into one string
Memory Usage:
- 100 projects × ~200 bytes per URL = ~20KB
- Temporary array: ~20KB
- Final string: ~20KB
- Peak memory: ~40KB
Alternative (Streaming):
For very large sites (10,000+ pages), consider streaming:
export const GET: APIRoute = async ({ site }) => {
const stream = new ReadableStream({
async start(controller) {
controller.enqueue('<?xml version="1.0"?>\n<urlset>');
const projects = await getCollection('projects');
for (const project of projects) {
controller.enqueue(`<url><loc>${site}projects/${project.id}</loc></url>`);
}
controller.enqueue('</urlset>');
controller.close();
}
});
return new Response(stream, {
headers: { 'Content-Type': 'application/xml' }
});
};
This uses constant memory regardless of site size.
XML Escaping
The Hidden Danger:
<loc>${siteUrl}projects/${project.id}</loc>
What if project.id contains special XML characters?
project.id = "my-project-&-tutorial"
Result: <loc>...my-project-&-tutorial</loc> // INVALID XML!
Proper Escaping:
function escapeXml(unsafe: string): string {
return unsafe
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}
<loc>${escapeXml(siteUrl)}projects/${escapeXml(project.id)}</loc>
🔴 Danger: The repo code doesn’t escape XML. This works because Astro’s content collection IDs are filename-based (alphanumeric + hyphens), but it’s a latent bug.
Edge Cases & Pitfalls
Missing Site URL
Problem: site is undefined if not configured in astro.config.mjs.
Current Behavior: Falls back to 'https://yoursite.com' (placeholder).
Better Approach: Fail fast during build:
export const GET: APIRoute = async ({ site }) => {
if (!site) {
throw new Error('site URL must be configured in astro.config.mjs');
}
const siteUrl = site.toString();
// ...
};
Duplicate URLs
Problem: If a project and tutorial have the same ID, they create duplicate URLs.
projects/my-guide.mdx → /projects/my-guide
tutorials/my-guide.mdx → /tutorials/my-guide
This is fine (different paths), but what if you have:
projects/my-guide.mdx → /projects/my-guide
projects/my-guide.md → /projects/my-guide // DUPLICATE!
Solution: Validate uniqueness:
const allUrls = new Set<string>();
projects.forEach(project => {
const url = `${siteUrl}projects/${project.id}`;
if (allUrls.has(url)) {
throw new Error(`Duplicate URL: ${url}`);
}
allUrls.add(url);
});
Invalid Dates
Problem: publishDate.toISOString() throws if the date is invalid.
---
publishDate: "not a date"
---
Protection: Zod schema validation catches this at build time, but add runtime check:
${projects.map(project => {
const lastmod = project.data.publishDate instanceof Date
? project.data.publishDate.toISOString()
: new Date().toISOString(); // Fallback to now
return `<url>
<loc>${siteUrl}projects/${project.id}</loc>
<lastmod>${lastmod}</lastmod>
</url>`;
}).join('')}
Trailing Slashes
Problem: Inconsistent trailing slashes confuse search engines.
https://yoursite.com/projects // No slash
https://yoursite.com/projects/ // With slash
These are treated as different URLs by search engines.
Solution: Normalize in config:
// astro.config.mjs
export default defineConfig({
site: 'https://jasontran.pages.dev',
trailingSlash: 'never', // or 'always' or 'ignore'
});
Then ensure sitemap matches:
const normalizeUrl = (url: string) => {
// Remove trailing slash if trailingSlash: 'never'
return url.replace(/\/$/, '');
};
<loc>${normalizeUrl(`${siteUrl}projects/${project.id}`)}</loc>
Forgetting robots.txt
Problem: Sitemap exists, but search engines don’t know where to find it.
Solution: Create public/robots.txt:
User-agent: *
Allow: /
Sitemap: https://jasontran.pages.dev/sitemap.xml
This tells crawlers where the sitemap is located.
Not Submitting to Search Engines
Problem: Sitemap exists, but you never told Google about it.
Solution: Submit to search consoles:
-
Google Search Console: https://search.google.com/search-console
- Add property → Verify ownership → Sitemaps → Submit sitemap URL
-
Bing Webmaster Tools: https://www.bing.com/webmasters
- Similar process
-
Automatic Discovery: Add to
<head>:<link rel="sitemap" type="application/xml" href="/sitemap.xml" />
Conclusion
Skills Acquired
You’ve learned:
- API Routes: Creating non-HTML endpoints in Astro
- XML Generation: Programmatically building valid XML documents
- Content Queries: Fetching and filtering content collections
- SEO Optimization: Understanding search engine crawling behavior
- Build-Time Generation: Computing data once at build time for static serving
The Proficiency Marker: Most developers use sitemap plugins without understanding how they work. You now understand sitemaps as programmatically generated indexes that bridge the gap between dynamic content and search engine expectations. This mental model transfers to:
- RSS feed generation
- API documentation generation (OpenAPI/Swagger)
- Static site generation patterns
- Build-time optimization strategies
Extending the Sitemap
Adding Image Sitemaps:
${projects.map(project => `
<url>
<loc>${siteUrl}projects/${project.id}</loc>
<lastmod>${project.data.publishDate.toISOString()}</lastmod>
<image:image>
<image:loc>${project.data.coverImage}</image:loc>
<image:title>${escapeXml(project.data.title)}</image:title>
</image:image>
</url>`).join('')}
Adding Video Sitemaps:
${projects.filter(p => p.data.videoUrls).map(project => `
<url>
<loc>${siteUrl}projects/${project.id}</loc>
<video:video>
<video:thumbnail_loc>${project.data.coverImage}</video:thumbnail_loc>
<video:title>${escapeXml(project.data.title)}</video:title>
<video:description>${escapeXml(project.data.description)}</video:description>
<video:content_loc>${project.data.videoUrls[0]}</video:content_loc>
</video:video>
</url>`).join('')}
Adding News Sitemaps:
// For time-sensitive content
${recentPosts.map(post => `
<url>
<loc>${siteUrl}blog/${post.id}</loc>
<news:news>
<news:publication>
<news:name>Your Site Name</news:name>
<news:language>en</news:language>
</news:publication>
<news:publication_date>${post.data.publishDate.toISOString()}</news:publication_date>
<news:title>${escapeXml(post.data.title)}</news:title>
</news:news>
</url>`).join('')}
Next Challenge: Implement a sitemap index that splits large sitemaps into multiple files (required for sites with 50,000+ URLs), following the Sitemap Protocol specification.