n8n XML Sitemap Automation: Parse, Index, and Monitor Your SEO at Scale
Your sitemap is the roadmap search engines use to discover your content, but most SEO professionals never verify if Google is actually following it. For agencies managing dozens of client sites, this blind spot compounds into thousands of potentially unindexed pages, missed competitor content changes, and ranking opportunities slipping through the cracks.
Manual sitemap monitoring does not scale. Checking each client’s sitemap for errors, tracking competitor content updates, and submitting URLs to indexing APIs one by one consumes hours that should go toward strategy. The data exists in structured XML format, but extracting actionable intelligence from it requires automation.
The Agency Sitemap Problem
Consider the typical digital agency scenario:
- 50+ client websites each with their own sitemap
- Competitors publishing daily while you check manually once a month
- Indexing delays costing clients rankings on time-sensitive content
- Broken links accumulating in sitemaps nobody monitors
- No systematic way to track what changed across your portfolio
Search engines provide APIs to accelerate indexing. Sitemaps follow a standard format. The infrastructure for automation exists. What’s missing is the workflow to connect these pieces.
What You’ll Learn
This guide covers everything SEO professionals need to automate sitemap operations at agency scale:
- Sitemap parsing fundamentals including nested sitemap indexes and gzipped files
- Google Indexing API workflows with service account setup and quota management
- Bing IndexNow integration for multi-engine indexing in a single workflow
- Competitor sitemap monitoring to detect content changes automatically
- Multi-client health checks that validate URLs across your entire portfolio
- Agency-scale patterns for managing 50+ sites from centralized workflows
- Complete workflow examples ready to import and customize
Understanding XML Sitemaps
Before automating, you need to understand what you’re parsing. XML sitemaps follow the Sitemap Protocol standard, which defines a structured format search engines universally support.
Sitemap Structure
A basic sitemap contains URL entries with metadata:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page-1</loc>
<lastmod>2024-01-15</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://example.com/page-2</loc>
<lastmod>2024-01-10</lastmod>
</url>
</urlset>
| Element | Required | Purpose |
|---|---|---|
<loc> | Yes | Full URL of the page |
<lastmod> | No | Last modification date (ISO 8601 format) |
<changefreq> | No | Expected change frequency hint |
<priority> | No | Relative importance (0.0 to 1.0) |
For automation, <loc> and <lastmod> matter most. The <loc> gives you URLs to process, while <lastmod> lets you filter for recently changed content.
Sitemap Index Files
Large sites split their sitemap across multiple files using a sitemap index:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-posts.xml</loc>
<lastmod>2024-01-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2024-01-14</lastmod>
</sitemap>
</sitemapindex>
Your workflows need to detect whether you’re dealing with a <urlset> (actual URLs) or a <sitemapindex> (references to other sitemaps). This distinction determines your processing logic.
Why Automation Matters for SEO
According to Google’s documentation on sitemaps, sitemaps help search engines discover pages they might miss through regular crawling. But submission alone does not guarantee indexing.
Automation enables:
- Faster indexing by submitting new content to indexing APIs immediately
- Proactive monitoring that catches broken URLs before they impact rankings
- Competitive intelligence through systematic tracking of competitor content
- Portfolio-wide visibility across all client sites from one dashboard
- Consistent execution that doesn’t depend on remembering to check manually
Parsing Sitemaps with n8n
The foundation of every sitemap workflow is fetching and converting XML to a format n8n can process. This pattern applies whether you’re indexing, monitoring, or auditing.
Basic Sitemap Parser Workflow
Manual Trigger → HTTP Request → XML to JSON → Split Out → Process
Step 1: Fetch the Sitemap
Use the HTTP Request node to retrieve the sitemap:
- Method: GET
- URL:
https://example.com/sitemap.xml - Response Format: Text (not JSON)
The response comes back as an XML string in the data field.
Step 2: Convert XML to JSON
Add the XML node to transform the structure:
- Mode: XML to JSON
- Property Name:
data - Explicit Array: On (keeps structure consistent)
The converted output looks like:
{
"urlset": {
"$": { "xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9" },
"url": [
{
"loc": ["https://example.com/page-1"],
"lastmod": ["2024-01-15"]
},
{
"loc": ["https://example.com/page-2"],
"lastmod": ["2024-01-10"]
}
]
}
}
Step 3: Extract URLs
Use a Code node to flatten the structure:
// Get all incoming items from previous node
const items = $input.all();
const sitemap = items[0].json;
// Initialize empty array to store extracted URLs
let urls = [];
// Check which type of sitemap we received
if (sitemap.urlset) {
// Regular sitemap: extract each URL and its last modified date
// Note: XML converts to arrays, so we access [0] for the first value
urls = sitemap.urlset.url.map(entry => ({
loc: entry.loc[0],
lastmod: entry.lastmod ? entry.lastmod[0] : null
}));
} else if (sitemap.sitemapindex) {
// Sitemap index: extract links to child sitemaps (not actual page URLs)
// Mark these with type: 'sitemap' so we know to fetch them next
urls = sitemap.sitemapindex.sitemap.map(entry => ({
loc: entry.loc[0],
lastmod: entry.lastmod ? entry.lastmod[0] : null,
type: 'sitemap'
}));
}
// Return each URL as a separate item for downstream processing
return urls.map(url => ({ json: url }));
This code handles both regular sitemaps and sitemap index files, outputting clean objects with loc and lastmod fields.
Handling Sitemap Index Files
When you encounter a sitemap index, you need to recursively fetch each child sitemap. Here’s the pattern:
HTTP Request (main sitemap) → XML to JSON → Code (detect type) →
IF sitemap index → Loop → HTTP Request (child) → XML to JSON → Merge
IF urlset → Continue processing
Detection logic in Code node:
// Get the parsed sitemap from previous node
const sitemap = $input.first().json;
// Detect type: sitemapindex exists only in index files
const isSitemapIndex = Boolean(sitemap.sitemapindex);
// Pass through original data plus detection results
return [{
json: {
...sitemap, // Keep all original data
isSitemapIndex, // Flag for IF node branching
childSitemaps: isSitemapIndex // List of child sitemap URLs to fetch
? sitemap.sitemapindex.sitemap.map(s => s.loc[0])
: []
}
}];
Use an IF node to branch based on isSitemapIndex, then loop through child sitemaps when needed.
Processing Gzipped Sitemaps
Many large sites serve compressed sitemaps (.xml.gz). The HTTP Request node handles this automatically when you:
- Request the
.xml.gzURL directly - Set Response Format to Text
- n8n decompresses automatically based on Content-Encoding headers
If automatic decompression fails, check if the server sends proper headers. Some CDNs misconfigure compression.
Complete Basic Parser Workflow
Here’s a ready-to-import workflow for basic sitemap parsing:
{
"name": "Basic Sitemap Parser",
"nodes": [
{
"parameters": {},
"name": "Manual Trigger",
"type": "n8n-nodes-base.manualTrigger",
"position": [250, 300]
},
{
"parameters": {
"url": "={{ $json.sitemapUrl || 'https://example.com/sitemap.xml' }}",
"options": {
"response": { "response": { "responseFormat": "text" } }
}
},
"name": "Fetch Sitemap",
"type": "n8n-nodes-base.httpRequest",
"position": [450, 300]
},
{
"parameters": {
"mode": "xmlToJson",
"options": {}
},
"name": "XML to JSON",
"type": "n8n-nodes-base.xml",
"position": [650, 300]
},
{
"parameters": {
"jsCode": "const sitemap = $input.first().json;\nlet urls = [];\n\nif (sitemap.urlset) {\n urls = sitemap.urlset.url.map(entry => ({\n loc: entry.loc[0],\n lastmod: entry.lastmod ? entry.lastmod[0] : null\n }));\n}\n\nreturn urls.map(url => ({ json: url }));"
},
"name": "Extract URLs",
"type": "n8n-nodes-base.code",
"position": [850, 300]
}
],
"connections": {
"Manual Trigger": { "main": [[{ "node": "Fetch Sitemap" }]] },
"Fetch Sitemap": { "main": [[{ "node": "XML to JSON" }]] },
"XML to JSON": { "main": [[{ "node": "Extract URLs" }]] }
}
}
Google Indexing API Automation
The Google Indexing API lets you notify Google about new or updated URLs, accelerating the crawl and index process. For agencies, this means faster visibility for client content.
Important Limitation
Google restricts the Indexing API to specific content types:
- JobPosting structured data pages
- BroadcastEvent with VideoObject
- Livestream structured data
For general web pages, the API still works but Google may deprioritize non-qualifying URLs. Many SEO professionals report success with broader content, but results vary.
Setting Up Google Cloud Service Account
Step 1: Create a Google Cloud Project
- Go to Google Cloud Console
- Create a new project or select existing
- Enable the Indexing API from the API Library
Step 2: Create Service Account
- Navigate to IAM & Admin > Service Accounts
- Click Create Service Account
- Name it (e.g., “n8n-indexing”)
- Grant no specific roles (not needed)
- Click Done
Step 3: Generate Key
- Click on your new service account
- Go to Keys tab
- Click Add Key > Create new key
- Choose JSON format
- Download and secure the key file
Step 4: Add to Search Console
- Open Google Search Console
- Select the property you want to index
- Go to Settings > Users and permissions
- Click Add user
- Enter the service account email (from the JSON key file)
- Set permission to Owner
Building the Indexing Workflow
Schedule Trigger → Fetch Sitemap → XML to JSON → Filter Recent →
Google Indexing API → Log Results
Credential Setup in n8n:
- Create a new Google Service Account credential
- Paste the entire JSON key file content
- Set scopes to include
https://www.googleapis.com/auth/indexing
HTTP Request for Indexing API:
Method: POST
URL: https://indexing.googleapis.com/v3/urlNotifications:publish
Authentication: Google Service Account
Headers:
Content-Type: application/json
Body:
{
"url": "{{ $json.loc }}",
"type": "URL_UPDATED"
}
The type field accepts:
URL_UPDATED: New or updated contentURL_DELETED: Removed content
Handling the 200/Day Quota
Google limits free accounts to 200 publish requests per day. For agencies with large portfolios, this requires smart prioritization.
Strategy 1: Filter by lastmod
Only submit URLs modified in the last 24-48 hours:
// Get all URLs from previous node
const items = $input.all();
// Calculate cutoff: only URLs modified in last 2 days
const cutoffDate = new Date();
cutoffDate.setDate(cutoffDate.getDate() - 2);
// Filter to recent URLs only
return items.filter(item => {
// Skip URLs without lastmod date
if (!item.json.lastmod) return false;
// Compare modification date against cutoff
const lastmod = new Date(item.json.lastmod);
return lastmod >= cutoffDate;
});
Strategy 2: Track Submitted URLs
Use n8n’s Static Data or an external database to track what’s been submitted:
// Access persistent storage that survives between workflow runs
const staticData = $getWorkflowStaticData('global');
const submitted = staticData.submittedUrls || {}; // {url: timestamp}
const now = Date.now();
const sevenDaysAgo = now - (7 * 24 * 60 * 60 * 1000); // 7 days in ms
// Get all URLs from previous node
const items = $input.all();
// Only keep URLs we haven't submitted recently
const toSubmit = items.filter(item => {
const url = item.json.loc;
const lastSubmitted = submitted[url];
// Include if never submitted OR submitted over 7 days ago
return !lastSubmitted || lastSubmitted < sevenDaysAgo;
});
// Limit to 200 to stay within Google's daily quota
return toSubmit.slice(0, 200);
Strategy 3: Prioritize by Page Type
Submit high-value pages first:
const items = $input.all();
// Assign importance scores based on URL patterns and freshness
const scored = items.map(item => {
let score = 0;
const url = item.json.loc;
// Higher scores = higher priority for indexing
if (url.includes('/product/')) score += 10; // Product pages: high value
if (url.includes('/blog/')) score += 5; // Blog posts: medium value
if (url.includes('/landing/')) score += 8; // Landing pages: high value
// Boost recently modified content
if (item.json.lastmod) {
const age = Date.now() - new Date(item.json.lastmod);
if (age < 86400000) score += 15; // 86400000ms = 24 hours
}
return { ...item, score };
});
// Sort highest scores first, take top 200 for quota
return scored
.sort((a, b) => b.score - a.score)
.slice(0, 200)
.map(item => ({ json: item.json }));
Error Handling
The Indexing API returns specific error codes:
| Code | Meaning | Action |
|---|---|---|
| 200 | Success | Log and continue |
| 403 | Not authorized | Check Search Console ownership |
| 429 | Quota exceeded | Stop processing, retry tomorrow |
| 400 | Invalid URL | Skip and log for review |
Add error handling with a Code node after the HTTP Request:
// Get response from Google Indexing API
const response = $input.first().json;
// Check if API returned an error
if (response.error) {
const errorCode = response.error.code;
// 429 = quota exceeded, stop entire workflow
if (errorCode === 429) {
throw new Error('Quota exceeded - stopping workflow');
}
// Other errors: log and continue to next URL
return [{
json: {
status: 'error',
code: errorCode,
message: response.error.message,
url: $('Previous Node').first().json.loc // Reference URL from earlier node
}
}];
}
// Success: extract useful response data
return [{
json: {
status: 'success',
url: $('Previous Node').first().json.loc,
// Optional chaining (?.) safely handles missing nested properties
notifyTime: response.urlNotificationMetadata?.latestUpdate?.notifyTime
}
}];
Bing IndexNow Integration
IndexNow is an open protocol supported by Bing, Yandex, and other search engines. Unlike Google’s API, IndexNow has no daily quota limits and works with any content type.
How IndexNow Works
- Generate an API key (any unique string)
- Host the key file at your domain root
- Submit URLs via simple HTTP request
- Participating search engines receive the notification
Setting Up IndexNow
Step 1: Generate API Key
Create a unique key (32+ alphanumeric characters):
// Generate in Code node or use online generator
const key = Array.from(crypto.getRandomValues(new Uint8Array(16)))
.map(b => b.toString(16).padStart(2, '0'))
.join('');
Step 2: Host Key File
Create a text file named {your-key}.txt containing just the key, and host it at:
https://yourdomain.com/{your-key}.txt
Step 3: Submit URLs
POST https://api.indexnow.org/indexnow
Content-Type: application/json
{
"host": "yourdomain.com",
"key": "your-api-key",
"urlList": [
"https://yourdomain.com/page-1",
"https://yourdomain.com/page-2"
]
}
Combined Google + Bing Workflow
Submit to both services in parallel for maximum coverage:
Schedule → Fetch Sitemap → XML to JSON → Filter Recent →
├── Google Indexing API
└── IndexNow API
→ Merge Results → Log
IndexNow HTTP Request:
Method: POST
URL: https://api.indexnow.org/indexnow
Headers:
Content-Type: application/json
Body:
{
"host": "{{ $('Config').first().json.domain }}",
"key": "{{ $('Config').first().json.indexNowKey }}",
"urlList": {{ JSON.stringify($input.all().map(i => i.json.loc)) }}
}
IndexNow accepts batches of up to 10,000 URLs per request, making it ideal for large sites.
Multi-Client IndexNow
For agencies, maintain a configuration node with client domains and keys:
const clients = [
{ domain: 'client1.com', indexNowKey: 'key1...', sitemapUrl: 'https://client1.com/sitemap.xml' },
{ domain: 'client2.com', indexNowKey: 'key2...', sitemapUrl: 'https://client2.com/sitemap.xml' },
// ... more clients
];
return clients.map(client => ({ json: client }));
Loop through clients and submit each to IndexNow with their respective keys.
Competitor Sitemap Monitoring
Tracking competitor content changes gives agencies strategic intelligence. When a competitor publishes new pages targeting your client’s keywords, you want to know immediately.
Building a Competitor Monitor
The core pattern:
- Fetch competitor sitemap
- Compare against previous snapshot
- Identify new, modified, and removed URLs
- Alert on significant changes
Storage Options for Snapshots:
- n8n Static Data: Simple, limited to workflow context
- Google Sheets: Easy to view and share
- Airtable/Notion: Rich filtering and views
- PostgreSQL/MySQL: Scalable for large datasets
Detection Workflow
Schedule (daily) → Fetch Competitor Sitemap → XML to JSON →
Code (compare with previous) → IF changes → Slack Alert
Comparison Logic:
// Get persistent storage for cross-run comparison
const staticData = $getWorkflowStaticData('global');
// Extract current sitemap URLs into simple objects
const currentUrls = $input.all().map(item => ({
loc: item.json.loc,
lastmod: item.json.lastmod
}));
// Load last run's snapshot (empty array if first run)
const previousUrls = staticData.competitorSnapshot || [];
// Create Maps for fast lookup: url -> lastmod date
const previousMap = new Map(previousUrls.map(u => [u.loc, u.lastmod]));
const currentMap = new Map(currentUrls.map(u => [u.loc, u.lastmod]));
// Find NEW: URLs in current that weren't in previous
const newUrls = currentUrls.filter(u => !previousMap.has(u.loc));
// Find REMOVED: URLs in previous that aren't in current
const removedUrls = previousUrls.filter(u => !currentMap.has(u.loc));
// Find MODIFIED: URLs that exist in both but lastmod changed
const modifiedUrls = currentUrls.filter(u => {
const prev = previousMap.get(u.loc);
return prev && prev !== u.lastmod; // Exists AND date changed
});
// Save current snapshot for next workflow run
staticData.competitorSnapshot = currentUrls;
// Return summary with all detected changes
return [{
json: {
competitor: $('Config').first().json.competitorDomain,
timestamp: new Date().toISOString(),
totalUrls: currentUrls.length,
newUrls: newUrls,
removedUrls: removedUrls,
modifiedUrls: modifiedUrls,
hasChanges: newUrls.length > 0 || removedUrls.length > 0 || modifiedUrls.length > 0
}
}];
Alert Configuration
Send meaningful alerts, not noise:
const changes = $input.first().json;
if (!changes.hasChanges) {
return []; // No alert needed
}
// Format for Slack
const blocks = [];
if (changes.newUrls.length > 0) {
blocks.push({
type: 'section',
text: {
type: 'mrkdwn',
text: `*New Pages (${changes.newUrls.length}):*\n${changes.newUrls.slice(0, 5).map(u => `• ${u.loc}`).join('\n')}${changes.newUrls.length > 5 ? `\n...and ${changes.newUrls.length - 5} more` : ''}`
}
});
}
if (changes.modifiedUrls.length > 0) {
blocks.push({
type: 'section',
text: {
type: 'mrkdwn',
text: `*Updated Pages (${changes.modifiedUrls.length}):*\n${changes.modifiedUrls.slice(0, 5).map(u => `• ${u.loc}`).join('\n')}`
}
});
}
return [{
json: {
channel: '#seo-alerts',
text: `Competitor changes detected: ${changes.competitor}`,
blocks: blocks
}
}];
Building Intelligence Dashboards
For deeper analysis, store competitor data in Google Sheets:
| Date | Competitor | New URLs | Removed URLs | Modified URLs | Total Pages |
|---|---|---|---|---|---|
| 2024-01-15 | competitor1.com | 5 | 0 | 12 | 1,234 |
| 2024-01-14 | competitor1.com | 2 | 1 | 8 | 1,229 |
Over time, this reveals:
- Content velocity (how often they publish)
- Content strategy (what sections grow)
- Pruning patterns (what they remove)
Link this data to your client reporting for strategic recommendations.
Multi-Client SEO Health Automation
Agencies need systematic monitoring across their entire client portfolio. Instead of checking sites one by one, build workflows that surface issues proactively.
URL Validation Workflow
Check that all sitemap URLs return 200 status:
Schedule (weekly) → Loop Clients → Fetch Sitemap → Extract URLs →
Loop URLs → HTTP Request (HEAD) → Filter Non-200 → Aggregate → Report
Efficient URL Checking:
Use HEAD requests to minimize bandwidth:
Method: HEAD
URL: {{ $json.loc }}
Options:
Follow Redirects: true
Timeout: 10000
Ignore Response Data: true
Aggregate Issues by Client:
// Get all URL check results from previous node
const items = $input.all();
const issues = {}; // Will hold: { clientName: [issue1, issue2, ...] }
// Group non-200 responses by client
items.forEach(item => {
const client = item.json.client;
const status = item.json.statusCode;
// Only track problematic URLs (not 200 OK)
if (status !== 200) {
// Initialize array for this client if first issue
if (!issues[client]) issues[client] = [];
issues[client].push({
url: item.json.url,
status: status,
error: item.json.error || null
});
}
});
// Convert to array format: one item per client with their issues
return Object.entries(issues).map(([client, urls]) => ({
json: { client, issues: urls, count: urls.length }
}));
Client Reporting Templates
Generate weekly SEO health reports:
const clientData = $input.first().json;
const report = {
client: clientData.client,
reportDate: new Date().toISOString().split('T')[0],
metrics: {
totalUrls: clientData.totalUrls,
indexedUrls: clientData.indexedUrls,
brokenUrls: clientData.brokenUrls.length,
newPages: clientData.newPages.length,
updatedPages: clientData.updatedPages.length
},
issues: clientData.brokenUrls.map(u => ({
url: u.url,
status: u.status,
severity: u.status === 404 ? 'high' : 'medium'
})),
recommendations: []
};
// Generate recommendations based on data
if (report.metrics.brokenUrls > 0) {
report.recommendations.push({
priority: 'high',
action: `Fix ${report.metrics.brokenUrls} broken URLs in sitemap`,
impact: 'Improves crawl efficiency and user experience'
});
}
if (report.metrics.newPages > 10) {
report.recommendations.push({
priority: 'medium',
action: 'Submit new pages to Google Indexing API',
impact: 'Accelerates indexing of recent content'
});
}
return [{ json: report }];
Tracking Page Removal and Redirects
Monitor for unintentional content loss:
const staticData = $getWorkflowStaticData('global');
const clientId = $json.clientId;
const currentUrls = new Set($input.all().map(i => i.json.loc));
const previousUrls = staticData[`${clientId}_urls`] || new Set();
// URLs that disappeared
const removedUrls = [...previousUrls].filter(url => !currentUrls.has(url));
// Check if removed URLs redirect or 404
const toCheck = removedUrls.map(url => ({ json: { url, type: 'removed' } }));
// Update stored URLs
staticData[`${clientId}_urls`] = [...currentUrls];
return toCheck;
Follow up with HTTP requests to determine if removed URLs:
- Return 404 (content deleted)
- Redirect 301 (intentional move)
- Redirect 302 (temporary, might be issue)
Agency-Scale Patterns
Managing sitemap operations across 50+ client sites requires architectural decisions that simple workflows do not address.
Centralized Configuration
Store all client data in a single source:
Google Sheets approach:
| Client | Domain | Sitemap URL | IndexNow Key | Google SA | Active |
|---|---|---|---|---|---|
| Acme Corp | acme.com | https://acme.com/sitemap.xml | abc123… | SA JSON | true |
| Beta Inc | beta.io | https://beta.io/sitemap.xml | def456… | SA JSON | true |
Code node to load config:
// Get all rows from Google Sheets
const sheets = $input.all();
// Filter to only active clients (skip paused/churned)
const activeClients = sheets.filter(row => row.json.Active === 'true');
// Transform to clean config objects for downstream nodes
return activeClients.map(row => ({
json: {
client: row.json.Client,
domain: row.json.Domain,
sitemapUrl: row.json['Sitemap URL'], // Bracket notation for spaces in column names
indexNowKey: row.json['IndexNow Key']
}
}));
Batching for Large Sitemaps
Sites with 10,000+ URLs need chunked processing. See our batch processing guide for detailed patterns.
Simple batching:
// Get all URLs from previous node
const items = $input.all();
const batchSize = 100; // URLs per batch
const batches = [];
// Split items into chunks of batchSize
for (let i = 0; i < items.length; i += batchSize) {
batches.push(items.slice(i, i + batchSize));
}
// Return each batch as a separate item with metadata
return batches.map((batch, index) => ({
json: {
batchIndex: index, // 0, 1, 2, ...
batchSize: batch.length, // Actual size (last batch may be smaller)
urls: batch.map(item => item.json.loc) // Just the URLs for this batch
}
}));
Scheduling Strategies
Balance thoroughness with resource usage:
| Operation | Recommended Frequency | Rationale |
|---|---|---|
| Google Indexing API | Daily (new content) | Quota limits, focus on fresh content |
| IndexNow submission | On content publish | No quota, immediate notification |
| Competitor monitoring | Daily | Catch changes quickly |
| Health checks (200 status) | Weekly | URLs don’t break frequently |
| Full sitemap diff | Weekly | Detect structural changes |
Stagger client processing:
Don’t run all 50 clients at the same time. Use scheduling patterns:
// Distribute clients across hours
const clients = $input.all();
const baseHour = 6; // Start at 6 AM
return clients.map((client, index) => ({
json: {
...client.json,
scheduledHour: (baseHour + Math.floor(index / 10)) % 24,
scheduledMinute: (index % 10) * 6 // Spread across the hour
}
}));
Rate Limiting Across Clients
When hitting external APIs across multiple clients, respect rate limits. Our API rate limiting guide covers this in depth.
Simple delay between clients:
Add a Wait node between client processing loops, or use Code node delays:
// Add delay between API calls
await new Promise(resolve => setTimeout(resolve, 1000)); // 1 second delay
Troubleshooting
Common XML Parsing Errors
“Non-whitespace before first tag”
The sitemap has content before the XML declaration. Common causes:
- BOM (Byte Order Mark) at file start
- Server adding headers or error messages
- Caching plugin prepending content
Solution:
// Get raw XML string from previous node
let xml = $json.data;
// Remove BOM (Byte Order Mark) - invisible character some editors add
if (xml.charCodeAt(0) === 0xFEFF) {
xml = xml.slice(1);
}
// Remove leading/trailing whitespace
xml = xml.trim();
// Find where actual XML starts (might have junk before it)
const xmlStart = xml.indexOf('<?xml'); // XML declaration
const tagStart = xml.indexOf('<'); // First tag
const start = xmlStart >= 0 ? xmlStart : tagStart;
// Remove any content before the XML starts
if (start > 0) {
xml = xml.slice(start);
}
// Return cleaned XML for processing
return [{ json: { data: xml } }];
“Unexpected close tag”
Malformed XML with mismatched tags. Usually means:
- Server error page returned instead of sitemap
- Sitemap generator bug
- Incomplete response due to timeout
Check with Code node:
const response = $json;
// Verify we got XML, not an error page
if (!response.data.includes('<urlset') && !response.data.includes('<sitemapindex')) {
throw new Error('Response does not appear to be a sitemap');
}
return [$input.first()];
API Quota Management
Google Indexing API 429 errors
Track submissions and stop before hitting limits:
// Access persistent storage for quota tracking
const staticData = $getWorkflowStaticData('global');
const today = new Date().toISOString().split('T')[0]; // "2024-01-15" format
// Reset counter when date changes (new day = fresh quota)
if (staticData.quotaDate !== today) {
staticData.quotaDate = today;
staticData.quotaUsed = 0;
}
// Calculate how many submissions we can still make today
const remaining = 200 - staticData.quotaUsed;
const urls = $input.all().slice(0, remaining); // Take only what quota allows
// Update our running count
staticData.quotaUsed += urls.length;
// Stop workflow if no quota left
if (urls.length === 0) {
throw new Error('Daily quota exhausted');
}
return urls;
Network Timeout Issues
Large sitemaps or slow servers cause timeouts. Solutions:
- Increase timeout: Set HTTP Request timeout to 30-60 seconds
- Retry logic: Use n8n’s built-in retry on failure
- Head request first: Check if URL responds before full fetch
For persistent issues, see our timeout troubleshooting guide.
When to Get Help
Sitemap automation at scale involves:
- Multiple API integrations with different authentication methods
- Data pipeline architecture for large URL volumes
- Error handling for unreliable external services
- Client-specific customizations and edge cases
Our workflow development services can build production-ready sitemap automation tailored to your agency’s needs. For architectural guidance on complex monitoring systems, explore our consulting services.
Use our workflow debugger to troubleshoot parsing issues, and the expression validator when building complex data transformations.
Frequently Asked Questions
How do I handle sitemaps with thousands of URLs without hitting memory limits?
Large sitemaps can exceed n8n’s memory capacity, especially when processing 50,000+ URLs with full metadata. The solution is chunked processing.
Approach 1: Process in batches
Instead of loading all URLs into memory, process them in chunks:
// Get parsed sitemap with all URLs
const sitemap = $input.first().json;
const allUrls = sitemap.urlset.url;
const chunkSize = 1000; // Process 1000 URLs at a time
// Get current chunk index (0 on first run)
const chunkIndex = $('Config').first().json.currentChunk || 0;
// Extract just this chunk of URLs
const chunk = allUrls.slice(
chunkIndex * chunkSize, // Start index
(chunkIndex + 1) * chunkSize // End index
);
// Check if there are more chunks to process
const hasMore = (chunkIndex + 1) * chunkSize < allUrls.length;
return [{
json: {
urls: chunk,
currentChunk: chunkIndex,
nextChunk: hasMore ? chunkIndex + 1 : null, // null when done
totalChunks: Math.ceil(allUrls.length / chunkSize)
}
}];
Use a loop-back pattern with an IF node to process the next chunk until complete.
Approach 2: Stream with pagination
Some sites support paginated sitemaps. Check if parameters like ?page=1 work and fetch incrementally.
Approach 3: External processing
For truly massive sitemaps (100,000+ URLs), consider preprocessing with a dedicated script and storing results in a database that n8n queries in manageable chunks.
Why does Google show my sitemap has errors but my workflow parses it fine?
Google’s sitemap validator is stricter than most XML parsers. Common discrepancies:
URL encoding issues: Your workflow sees <loc>https://site.com/café</loc> fine, but Google expects <loc>https://site.com/caf%C3%A9</loc> with proper URL encoding.
Mixed protocols: If your sitemap contains both HTTP and HTTPS URLs, or URLs that don’t match your Search Console property, Google flags errors.
Namespace requirements: Google validates against the exact Sitemap Protocol schema. Missing or incorrect xmlns declarations cause validation failures that parsers ignore.
Solution: After parsing, add validation:
const url = $json.loc;
const errors = []; // Collect all validation issues
// Test 1: Is it a valid URL structure?
try {
new URL(url); // Throws if invalid
} catch {
errors.push('Invalid URL format');
}
// Test 2: Are special characters properly encoded?
// If encoding then decoding changes the URL, it has issues
if (url !== encodeURI(decodeURI(url))) {
errors.push('URL encoding issues');
}
// Test 3: Is it using HTTPS? (Google prefers HTTPS)
const expectedProtocol = 'https:';
if (!url.startsWith(expectedProtocol)) {
errors.push('Non-HTTPS URL');
}
// Return URL with validation results
return [{ json: { url, errors, valid: errors.length === 0 } }];
Can I use the Google Indexing API for any website or only specific types?
Officially, Google restricts the Indexing API to pages with JobPosting, BroadcastEvent (with VideoObject), or livestream structured data. These are time-sensitive content types where fast indexing matters most.
In practice: Many SEO professionals use the API for general content and report mixed results. Google may:
- Accept the submission but not prioritize crawling
- Index the page at normal speed rather than accelerated
- Ignore submissions entirely for non-qualifying pages
Recommendation for agencies:
- Use the Indexing API for qualifying client pages (job boards, event sites)
- Use IndexNow for all other content (no restrictions)
- Submit to Google Search Console programmatically as a fallback
- Monitor actual indexing in Search Console to verify effectiveness
Never rely solely on the Indexing API for critical content. Combine with IndexNow and ensure proper technical SEO fundamentals.
How often should I run sitemap monitoring workflows?
Frequency depends on the use case and acceptable latency:
| Monitoring Type | Recommended Frequency | Rationale |
|---|---|---|
| Own site indexing | On publish + daily catchup | Submit new content immediately, daily for any missed |
| Competitor tracking | Daily | Balance between awareness and resource usage |
| Health checks (broken URLs) | Weekly | URLs rarely break spontaneously |
| Full structural audit | Monthly | Catch gradual issues like bloat or orphaned sections |
| Critical client sites | Every 6 hours | Higher priority justifies more resources |
Cost considerations: Each workflow run consumes n8n executions and potentially API quotas. Running competitor monitoring every hour across 50 competitors means 1,200 executions daily just for that workflow.
Start conservative: Begin with daily monitoring and increase frequency only when you’ve proven the value and have capacity.
What’s the difference between submitting to Google via Indexing API vs Search Console?
Both notify Google about URLs, but they work differently:
Indexing API:
- Direct API call with OAuth authentication
- Designed for high-priority, time-sensitive content
- 200 submissions/day quota (free tier)
- Immediate acknowledgment
- Best for: Job postings, live events, breaking news
Search Console sitemap submission:
- Submit entire sitemap for Google to crawl
- No per-URL quota (sitemap-level)
- Google decides crawl priority and timing
- Can take hours to days for new content
- Best for: Bulk updates, ongoing maintenance
Programmatic Search Console submission:
You can submit sitemaps via the Search Console API:
PUT https://www.googleapis.com/webmasters/v3/sites/{siteUrl}/sitemaps/{feedpath}
This tells Google “my sitemap changed, please re-crawl” without per-URL limits.
Optimal strategy: Combine both approaches:
- Submit qualifying pages to Indexing API for fast indexing
- Submit all other pages to IndexNow
- Ping Search Console when sitemap updates significantly
- Let normal crawling handle the rest