n8n Web Scraping: Extract, Clean, and Process Website Data
The data you need is sitting right there on websites, but getting it into your systems feels impossible.
Competitor prices update daily. Lead lists live in business directories. Industry news scatters across dozens of sources. Product catalogs change without warning. And you’re stuck copying and pasting like it’s 1999.
The Manual Data Trap
Every hour spent copying website data into spreadsheets is an hour wasted. Manual collection introduces errors: typos, missed updates, inconsistent formatting.
But the real cost isn’t time.
It’s the decisions you can’t make because you don’t have current data. By the time you manually check competitor prices, they’ve already changed. By the time you copy that lead list, half the contacts are stale.
The alternatives aren’t great either:
- Hiring developers for custom scrapers costs thousands
- Third-party scraping services charge per request
- Most “no-code” tools lock you into their platforms
Here’s the truth: Web scraping automation captures data when it changes, formats it consistently, and delivers it where you need it. No human in the loop means no delays, no errors, and no limits on scale.
Why n8n for Web Scraping
Traditional web scraping requires programming. Python with BeautifulSoup. Node.js with Puppeteer. Code-heavy approaches that need developers to build and maintain.
n8n offers something different:
- Visual workflow building shows exactly how data flows
- Native HTTP and HTML nodes handle most scraping without external services
- Community nodes add headless browser support when needed
- Self-hosting option means no per-request costs at any scale
- 400+ integrations route scraped data anywhere
Whether you’re monitoring prices, generating leads, or aggregating content, n8n provides the building blocks for the entire pipeline.
What You’ll Learn
This guide covers everything for production-ready scraping:
- Core scraping patterns using HTTP Request and HTML nodes
- Static website extraction with CSS selectors and pagination handling
- Dynamic content strategies for JavaScript-rendered pages
- Anti-bot protection bypass techniques that actually work
- AI-powered extraction when traditional selectors fail
- Data cleaning workflows for production-quality output
- Complete real-world examples you can adapt immediately
- Troubleshooting techniques for common scraping failures
By the end, you’ll know exactly which approach to use for any website and how to build reliable, maintainable scraping automations.
Web Scraping Fundamentals in n8n
Every n8n scraping workflow follows a predictable pattern. Understanding this foundation makes building new scrapers straightforward.
The Three-Node Pattern
Most scraping workflows use three core components:
Trigger → HTTP Request → HTML Extract → Process/Store
Trigger starts the workflow. Schedule triggers run on a timer. Webhook triggers respond to external events. Manual triggers work for testing.
HTTP Request fetches the webpage. It downloads the raw HTML just like your browser does, but without rendering JavaScript or loading images.
HTML Extract parses the HTML and pulls out specific data using CSS selectors. It transforms messy HTML into clean JSON you can work with.
Everything after extraction is processing: cleaning data, transforming formats, filtering duplicates, and routing to destinations like databases, spreadsheets, or APIs.
Static vs. Dynamic Content
This distinction determines which approach you’ll use:
| Content Type | How to Identify | Approach |
|---|---|---|
| Static HTML | View Page Source (Ctrl+U) shows the data | HTTP Request + HTML node |
| JavaScript-rendered | Data appears in browser but not in page source | Headless browser or API discovery |
| API-backed | Network tab shows JSON requests | Call the API directly |
To check a website:
- Open the page in your browser
- Press Ctrl+U (or Cmd+Option+U on Mac) to view source
- Search for the data you want to extract
- If it’s there, use the standard approach
- If it’s not, the site renders content with JavaScript
Most websites still serve static HTML for their main content. JavaScript rendering is more common on web applications (dashboards, SPAs) than on content sites, e-commerce, or directories.
Node Combinations for Different Scenarios
| Scenario | Nodes | Notes |
|---|---|---|
| Simple page scrape | HTTP Request → HTML | Works for most static sites |
| Multiple pages | Schedule → Loop → HTTP → HTML → Aggregate | Add delays between requests |
| Login required | HTTP Request (auth) → HTTP Request (page) → HTML | Manage session cookies |
| JavaScript site | HTTP to headless service → HTML | Use Browserless, ScrapingBee, etc. |
| API available | HTTP Request (API endpoint) | Skip HTML parsing entirely |
| AI extraction | HTTP Request → AI node | When structure is unpredictable |
For detailed configuration of these nodes, see our HTTP Request node guide and HTML node documentation.
Static Website Scraping
Let’s build a complete scraping workflow step by step. We’ll scrape product information from a catalog page.
Step 1: Fetch the Page
Add an HTTP Request node:
- Set Method to GET
- Enter your target URL
- Click Test step to verify it returns HTML
The response contains the raw HTML in the data field.
Pro tip: Add request headers to appear more like a real browser:
| Header | Value |
|---|---|
| User-Agent | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 |
| Accept | text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 |
| Accept-Language | en-US,en;q=0.5 |
These headers prevent many basic bot-detection systems from blocking your requests.
Step 2: Extract Data with CSS Selectors
Add an HTML node after HTTP Request:
- Set Operation to “Extract HTML Content”
- Set Source Data to JSON
- Set JSON Property to
data(or whatever field contains your HTML) - Add extraction values for each piece of data you need
Example extraction configuration:
| Key | CSS Selector | Return Value | Return Array |
|---|---|---|---|
productNames | .product-card h2 | Text | Yes |
prices | .product-card .price | Text | Yes |
links | .product-card a | Attribute (href) | Yes |
images | .product-card img | Attribute (src) | Yes |
Enable “Return Array” when extracting multiple items of the same type. Without it, you only get the first match.
Step 3: Combine Parallel Arrays
The HTML node returns parallel arrays. To create structured product objects, add a Code node:
const items = $input.all();
const data = items[0].json;
// Combine parallel arrays into product objects
const products = data.productNames.map((name, index) => ({
json: {
name: name,
price: data.prices[index],
url: data.links[index],
image: data.images[index]
}
}));
return products;
Each product becomes a separate n8n item, ready for further processing or storage.
Step 4: Handle Pagination
Most sites split results across multiple pages. Handle pagination with a loop:
Option 1: Known page count
Use the Loop node with a fixed iteration count, building URLs like /products?page=1, /products?page=2, etc.
Option 2: Follow “Next” links
Extract the next page URL from each page and loop until no more pages exist.
Option 3: HTTP Request pagination
The HTTP Request node has built-in pagination support. Configure it under Options → Pagination to automatically follow page links.
Important: Add delays between requests using the Wait node. Hammering a server with rapid requests gets you blocked. Start with 1-2 seconds between requests and adjust based on the site’s response.
For large-scale scraping with rate limiting, see our API rate limits guide.
Step 5: Clean and Store Data
Before storing, clean the extracted data. Common cleanup tasks:
const items = $input.all();
return items.map(item => {
const data = item.json;
return {
json: {
// Remove currency symbols, convert to number
price: parseFloat(data.price.replace(/[^0-9.]/g, '')),
// Clean whitespace
name: data.name.trim(),
// Make URLs absolute
url: data.url.startsWith('http')
? data.url
: `https://example.com${data.url}`,
// Add metadata
scrapedAt: new Date().toISOString()
}
};
});
Then route to your destination: Google Sheets, Airtable, PostgreSQL, or any of n8n’s 400+ integrations.
For complex data transformation patterns, see our data transformation guide.
Dynamic Content and JavaScript-Rendered Sites
When HTTP Request returns HTML without your target data, you’re dealing with a JavaScript-rendered site. The content loads after initial page load through JavaScript execution.
Why Standard HTTP Fails
The HTTP Request node downloads raw HTML. It doesn’t:
- Execute JavaScript
- Wait for AJAX requests to complete
- Render the page visually
Modern web applications (React, Vue, Angular) often render content entirely through JavaScript. The initial HTML is just a shell that says “load this JavaScript file,” and the actual content appears only after that JavaScript runs.
Solution 1: Find the Underlying API
Many JavaScript sites fetch data from APIs. You can call these APIs directly, bypassing JavaScript rendering entirely.
How to find the API:
- Open the page in Chrome
- Press F12 to open DevTools
- Go to the Network tab
- Filter by “Fetch/XHR”
- Refresh the page and interact with it
- Look for JSON responses containing your target data
Once you find the API endpoint, use HTTP Request to call it directly. This is faster and more reliable than any rendering approach.
Example: A product page might fetch data from https://api.example.com/products/123. Call that endpoint directly instead of scraping the HTML.
Solution 2: Headless Browser Services
When no API exists, use a headless browser service. These services run a real browser, execute JavaScript, and return the fully-rendered HTML.
Popular options:
| Service | How It Works | Pricing Model |
|---|---|---|
| Browserless | Hosted Puppeteer/Playwright | Per-minute or flat rate |
| ScrapingBee | Managed scraping API | Per-request |
| Firecrawl | AI-powered extraction | Per-request |
| Apify | Full scraping platform | Per-compute-unit |
Integration pattern:
- Use HTTP Request to call the service’s API
- Send your target URL in the request
- Receive rendered HTML in the response
- Parse with the HTML node as normal
// Example ScrapingBee request body
{
"url": "https://javascript-heavy-site.com/page",
"render_js": true,
"wait": 2000
}
The service handles JavaScript execution, and you get back HTML that contains all the data.
Solution 3: Community Nodes
The n8n community has built nodes for browser automation:
- Puppeteer nodes provide direct headless Chrome control
- Playwright nodes offer cross-browser automation
Search the n8n community nodes for current options. Installation requires self-hosted n8n with community nodes enabled.
Comparison: Which Approach to Use
| Approach | Best For | Drawbacks |
|---|---|---|
| API Discovery | Sites with clean APIs | Requires investigation time |
| Headless Services | Occasional dynamic scraping | Per-request costs |
| Community Nodes | High-volume dynamic scraping | Requires self-hosting, more setup |
| Code + Puppeteer | Maximum control | Requires programming knowledge |
Start with API discovery. It’s faster, cheaper, and more reliable. Only move to headless rendering when no API exists.
Bypassing Anti-Bot Protection
Websites invest in blocking automated access. Understanding these protections helps you work around them ethically.
Why Sites Block Scrapers
Legitimate reasons exist:
- Prevent server overload from aggressive crawling
- Protect against content theft and competitive scraping
- Reduce costs from serving non-human traffic
- Enforce terms of service
When scraping, respect the site’s resources. Slow down requests. Don’t scrape content you don’t have rights to use. Check if the site offers an official API.
Headers and User-Agent Configuration
The simplest protection checks User-Agent headers. Requests without browser-like headers get blocked immediately.
Configure realistic headers in HTTP Request:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Rotate User-Agent strings occasionally to avoid fingerprinting. Maintain a list of current browser User-Agents and select randomly.
Rate Limiting and Delays
Aggressive request patterns trigger rate limiting faster than anything else.
Best practices:
- Add Wait nodes between requests (1-5 seconds minimum)
- Randomize delays slightly to appear more human
- Respect Retry-After headers when rate limited
- Use batch processing to spread requests over time
For rate limiting patterns, see our batch processing guide.
Proxy Rotation
Some sites block by IP address. Proxy rotation sends each request from a different IP.
Proxy options:
| Type | Use Case | Considerations |
|---|---|---|
| Datacenter proxies | Basic protection bypass | Cheap but often detected |
| Residential proxies | Moderate protection | More expensive, higher success |
| Mobile proxies | Strong protection | Most expensive, highest success |
Configure proxies in HTTP Request under Options → Proxy.
Many headless browser services include proxy rotation in their API, simplifying setup.
When to Use Managed Services
For sites with strong protection (Cloudflare, PerimeterX, DataDome), managed scraping services are often the only practical solution.
These services:
- Maintain proxy pools
- Handle CAPTCHA solving
- Manage browser fingerprints
- Rotate all detectable parameters
The per-request cost is justified when:
- You’re scraping high-protection sites
- Volume is low enough that costs remain reasonable
- Building and maintaining your own infrastructure isn’t worth the effort
Respecting robots.txt
Check robots.txt before scraping any site. This file at /robots.txt declares which paths crawlers should avoid.
While robots.txt isn’t legally binding, respecting it:
- Demonstrates good faith
- Reduces chance of IP blocks
- Avoids scraping pages that might cause issues
Read more about the standard at robotstxt.org.
AI-Powered Web Scraping
Traditional scraping relies on CSS selectors. When page structure is unpredictable or constantly changing, AI extraction offers an alternative.
When AI Beats CSS Selectors
| Scenario | CSS Selectors | AI Extraction |
|---|---|---|
| Consistent page structure | Better | Overkill |
| Structure varies by page | Fragile | Robust |
| Dynamic class names | Requires maintenance | Handles automatically |
| Unstructured text content | Can’t parse | Natural language capable |
| New sites frequently | Selector per site | One prompt handles all |
AI extraction shines when you’re scraping many different sites or when the site structure changes frequently.
Integration Pattern
Combine HTTP Request with AI nodes:
HTTP Request → HTML (extract raw content) → AI Node → Process
Example prompt for product extraction:
Extract the following information from this product page HTML:
- Product name
- Price (as a number without currency symbol)
- Description (first paragraph only)
- Availability status
Return as JSON:
{
"name": "...",
"price": 29.99,
"description": "...",
"inStock": true
}
HTML:
{{ $json.data }}
The AI parses the content and returns structured data regardless of the specific HTML structure.
Cost Considerations
AI extraction costs more than CSS selectors:
- Each extraction requires an API call to your AI provider
- Token usage scales with HTML size
- Latency is higher than local parsing
Optimize costs:
- Strip unnecessary HTML before sending (remove scripts, styles, navigation)
- Use faster/cheaper models for simple extractions
- Batch multiple extractions when possible
- Cache results to avoid re-processing identical pages
For more on AI integration patterns, see our AI Agent guide.
Hybrid Approach
Combine CSS selectors for consistent elements and AI for variable content:
// In Code node after HTML extraction
const structured = {
// CSS-extracted (reliable structure)
title: $json.title,
price: $json.price,
// AI will extract from remaining HTML
rawDescription: $json.descriptionHtml
};
return [{ json: structured }];
Then send rawDescription to an AI node for intelligent parsing while keeping reliable CSS extraction for stable elements.
Data Cleaning and Transformation
Raw scraped data is messy. Websites weren’t designed for machine consumption. Cleaning transforms raw extractions into production-quality data.
Common Data Issues
| Issue | Example | Solution |
|---|---|---|
| Whitespace | " Product Name " | .trim() |
| Currency symbols | "$29.99" | Regex replacement |
| Inconsistent formats | "Dec 24, 2025" vs "2025-12-24" | Date parsing |
| HTML entities | "Tom & Jerry" | Entity decoding |
| Relative URLs | "/product/123" | Prepend base URL |
| Missing values | null or undefined | Default values |
Code Node Cleanup Patterns
const items = $input.all();
return items.map(item => {
const raw = item.json;
return {
json: {
// Clean text
name: (raw.name || '').trim(),
// Parse price
price: parseFloat((raw.price || '0').replace(/[^0-9.]/g, '')),
// Parse date
date: DateTime.fromFormat(raw.dateText, 'MMM d, yyyy').toISO(),
// Make URL absolute
url: raw.url?.startsWith('http')
? raw.url
: `https://example.com${raw.url}`,
// Default for missing values
category: raw.category || 'Uncategorized',
// Decode HTML entities
description: raw.description
?.replace(/&/g, '&')
?.replace(/</g, '<')
?.replace(/>/g, '>'),
// Add metadata
scrapedAt: new Date().toISOString(),
source: 'website-scraper'
}
};
});
Deduplication
When scraping regularly, avoid storing duplicates. Use the Remove Duplicates node or implement in Code:
const items = $input.all();
const seen = new Set();
const unique = [];
for (const item of items) {
const key = item.json.url; // Or product ID, etc.
if (!seen.has(key)) {
seen.add(key);
unique.push(item);
}
}
return unique;
Validation
Filter out invalid records before they reach your database:
const items = $input.all();
return items.filter(item => {
const data = item.json;
// Must have name and valid price
if (!data.name || data.name.length < 2) return false;
if (isNaN(data.price) || data.price <= 0) return false;
// Must have valid URL
if (!data.url?.startsWith('http')) return false;
return true;
});
For more transformation patterns, see our comprehensive data transformation guide.
Production-Ready Scraping Workflows
Theory is useful. Real-world examples show how these concepts combine.
Example 1: E-commerce Price Monitoring
Use case: Track competitor prices daily and alert when they change significantly.
Workflow structure:
Schedule Trigger (daily)
→ Loop (for each competitor URL)
→ HTTP Request
→ HTML Extract (price, availability)
→ Code (clean data)
→ Aggregate
→ Compare Datasets (vs yesterday)
→ Filter (significant changes only)
→ Slack/Email notification
→ Database update
Key configuration:
- Schedule for off-peak hours (early morning)
- 2-second delays between requests per domain
- Compare with previous day’s data using Compare Datasets node
- Alert only when price changes exceed threshold (e.g., 5%)
- Store historical data for trend analysis
Common pitfalls:
- Price changes during sales events aren’t meaningful alerts
- Currency/format changes can break parsing
- Product availability should be tracked alongside price
Example 2: Lead Generation from Business Directories
Use case: Build prospect lists by scraping business directories.
Workflow structure:
Schedule Trigger (weekly)
→ HTTP Request (search results page)
→ HTML Extract (business links)
→ Loop (for each business)
→ HTTP Request (business detail page)
→ HTML Extract (name, phone, email, address)
→ Code (validate & clean)
→ Aggregate
→ Remove Duplicates
→ Filter (has email OR has phone)
→ CRM/Spreadsheet update
Key configuration:
- Respect rate limits strictly on directories (they block aggressively)
- Validate email format before storing
- Deduplicate against existing contacts
- Add source attribution for compliance
Common pitfalls:
- Many directories block scrapers aggressively
- Contact information may be outdated
- Some directories have terms prohibiting scraping
- Always verify leads before outreach
Example 3: Content Aggregation with AI Summary
Use case: Monitor industry news sources and create a daily digest.
Workflow structure:
Schedule Trigger (daily)
→ Loop (for each news source)
→ HTTP Request
→ HTML Extract (article links, titles, dates)
→ Aggregate all articles
→ Filter (today's articles only)
→ Loop (for each article)
→ HTTP Request (full article)
→ HTML Extract (content)
→ AI Summarization (3-sentence summary)
→ Aggregate
→ Generate HTML Template (digest format)
→ Send Email
Key configuration:
- Extract publication dates to filter recent content only
- AI prompt: “Summarize in 3 sentences for a business audience”
- Template for readable email digest
- Include source links for full articles
Common pitfalls:
- Paywalled content won’t be accessible
- AI costs accumulate with many articles
- Summarization may miss nuance
For workflow architecture patterns, see our workflow best practices guide.
Troubleshooting Common Issues
When scraping fails, systematic debugging finds the problem.
Quick Diagnosis Table
| Symptom | Likely Cause | Solution |
|---|---|---|
| Empty extraction results | JavaScript-rendered content | Check page source vs. browser view |
| 403 Forbidden | Missing headers or IP blocked | Add browser headers, use proxy |
| 429 Too Many Requests | Rate limit exceeded | Add delays, reduce frequency |
| Selector returns wrong data | Selector too generic | Make selector more specific |
| Inconsistent results | Dynamic page elements | Use more stable selectors |
| Timeout errors | Page too large or slow | Increase timeout, optimize request |
| Encoding issues | Character set mismatch | Specify encoding in request |
Debugging HTTP Request Failures
Step 1: Enable “Full Response” in Options to see status codes and headers.
Step 2: Check the response body for error messages. Many sites return helpful error descriptions.
Step 3: Test the same URL in your browser. If it works there but not in n8n, the issue is likely headers or bot detection.
Step 4: Compare your request headers with what the browser sends (visible in DevTools Network tab).
Debugging HTML Extraction Failures
Step 1: Pin the HTTP Request output and verify HTML contains your target data.
Step 2: Test your selector in the browser console: document.querySelectorAll('your-selector')
Step 3: If it works in browser but not n8n, the content is probably JavaScript-rendered.
Step 4: Check for dynamic class names that change between page loads.
Debugging Data Transformation
Step 1: Add console.log() statements in Code nodes:
const items = $input.all();
console.log('Input count:', items.length);
console.log('First item:', JSON.stringify(items[0], null, 2));
Check browser console (F12) for output.
Step 2: Process one item at a time before handling arrays.
Step 3: Use try-catch to identify which transformation fails:
try {
// Your transformation
} catch (error) {
console.log('Error:', error.message);
return [{ json: { error: error.message, input: item.json }}];
}
For complex debugging scenarios, use our workflow debugger tool.
Best Practices for Reliable Scraping
Production scrapers need monitoring, error handling, and maintenance strategies.
Monitor for Structural Changes
Websites change their HTML without warning. When your selectors stop matching, extraction fails silently.
Detection strategies:
- Log extraction counts and alert when they drop to zero
- Validate extracted data format (prices should be numbers, etc.)
- Run a validation workflow that checks sample extractions daily
Implement Error Handling
Use n8n’s error handling features:
- Enable Continue On Fail on HTTP Request nodes
- Add Error Trigger workflows for critical scrapers
- Log failures with full context for debugging
- Send alerts for repeated failures
Log Everything
Maintain records of:
- When each scrape ran
- How many items were extracted
- Any errors encountered
- Data samples for comparison
This historical data helps diagnose issues and prove compliance.
Legal and Ethical Considerations
Web scraping legality varies by jurisdiction and use case. This isn’t legal advice, but general principles:
- Public data is generally accessible
- Terms of Service may restrict scraping
- Rate limiting shows good faith
- Personal data has special restrictions (GDPR, CCPA)
- Copyright applies to content regardless of accessibility
When in doubt, consult legal counsel. Many businesses offer APIs or data partnerships as alternatives to scraping.
Frequently Asked Questions
How do I scrape JavaScript-rendered websites in n8n?
The standard HTTP Request node doesn’t execute JavaScript. You have three options:
1. Find the underlying API. Open browser DevTools (F12), go to Network tab, filter by “Fetch/XHR”, and look for JSON responses as the page loads. If you find the API endpoint, call it directly with HTTP Request. This is the fastest and most reliable approach.
2. Use a headless browser service. Services like Browserless, ScrapingBee, or Firecrawl render pages in real browsers and return the final HTML. Call their API with HTTP Request, then parse the returned HTML with the HTML node.
3. Install community nodes. Puppeteer or Playwright community nodes provide direct browser control. This requires self-hosted n8n with community nodes enabled.
Start with API discovery. It works in the majority of cases and eliminates the complexity and cost of browser rendering.
Can n8n bypass Cloudflare protection?
Native n8n nodes cannot bypass aggressive Cloudflare protection. Cloudflare’s challenge pages require JavaScript execution and sophisticated browser fingerprinting that HTTP Request doesn’t support.
Your options:
- Check for APIs that might not have the same protection
- Use managed scraping services (ScrapingBee, ScraperAPI) that specialize in bypass
- Residential proxies with headless browser services increase success rates
- Contact the site to request API access or whitelisting
Some Cloudflare-protected sites allow access with proper headers and reasonable request rates. Others block everything except known browsers. Test your specific target site.
What’s the difference between using HTTP Request + HTML node versus third-party scraping services?
HTTP Request + HTML node (native approach):
| Pros | Cons |
|---|---|
| Free (no per-request costs) | Can’t handle JavaScript rendering |
| Fast (no external API calls) | No built-in proxy rotation |
| Full control | You manage headers and rate limiting |
| Works offline | No anti-bot bypass |
Third-party scraping services:
| Pros | Cons |
|---|---|
| Handle JavaScript rendering | Per-request costs |
| Built-in proxy rotation | External dependency |
| Anti-bot bypass | Slower (network hop) |
| Managed infrastructure | Less control |
Use native nodes when: Sites serve static HTML, you control request volume, and cost matters.
Use services when: Sites use JavaScript rendering, have anti-bot protection, or you need high reliability without maintaining infrastructure.
How do I handle dynamic class names that change on every page load?
Modern frameworks (React, Next.js, Vue) often generate class names with random suffixes like ProductCard_title__x7Yz9. These change when the site rebuilds.
Solutions:
1. Use partial class matching:
[class^="ProductCard_title"]
[class*="ProductCard"]
2. Find stable identifiers:
data-testidattributes added for testingidattributes- Semantic HTML elements (
article,main,section) - Schema.org markup (
itemtype,itemprop)
3. Use structural selectors:
.product-grid > div:first-child h2
main > section > article p
4. Extract parent element and parse with Code node: Get the outer HTML and use JavaScript string methods or regex to extract data regardless of class names.
5. Use AI extraction: When structure is truly unpredictable, AI can parse content without relying on specific selectors.
Monitor your scrapers regularly. Even stable selectors break when sites redesign.
Is web scraping legal?
Web scraping legality depends on jurisdiction, what you scrape, and how you use it. This is general information, not legal advice.
Generally permissible:
- Scraping publicly accessible data
- Personal use and research
- Aggregating facts (not copyrighted expression)
Potentially problematic:
- Violating Terms of Service (breach of contract)
- Scraping personal data without consent (GDPR, CCPA)
- Copying copyrighted content (copyright infringement)
- Causing server damage through aggressive crawling (computer fraud laws)
- Circumventing access controls (CFAA in US)
Best practices:
- Check robots.txt and Terms of Service
- Use reasonable request rates
- Don’t scrape personal data without consent
- Don’t republish copyrighted content
- Consider official APIs or data partnerships
When scraping is business-critical, consult legal counsel familiar with your jurisdiction and use case.
When to Get Professional Help
Some scraping projects require specialized expertise:
- High-volume enterprise scraping with strict reliability requirements
- Complex anti-bot bypass for protected sites
- Data pipeline architecture connecting scraped data to multiple systems
- Compliance-sensitive scraping requiring legal review
- AI-powered extraction for unstructured content at scale
Our workflow development services build production-ready scraping solutions tailored to your requirements. For architectural guidance on complex automation projects, explore our n8n consulting services.