n8n Web Scraping: Extract, Clean, and Process Website Data
n8n Web Scraping: Extract, Clean, and Process Website Data
• Logic Workflow Team

n8n Web Scraping: Extract, Clean, and Process Website Data

#n8n #web scraping #data extraction #HTML #automation #tutorial #API

The data you need is sitting right there on websites, but getting it into your systems feels impossible.

Competitor prices update daily. Lead lists live in business directories. Industry news scatters across dozens of sources. Product catalogs change without warning. And you’re stuck copying and pasting like it’s 1999.

The Manual Data Trap

Every hour spent copying website data into spreadsheets is an hour wasted. Manual collection introduces errors: typos, missed updates, inconsistent formatting.

But the real cost isn’t time.

It’s the decisions you can’t make because you don’t have current data. By the time you manually check competitor prices, they’ve already changed. By the time you copy that lead list, half the contacts are stale.

The alternatives aren’t great either:

  • Hiring developers for custom scrapers costs thousands
  • Third-party scraping services charge per request
  • Most “no-code” tools lock you into their platforms

Here’s the truth: Web scraping automation captures data when it changes, formats it consistently, and delivers it where you need it. No human in the loop means no delays, no errors, and no limits on scale.

Why n8n for Web Scraping

Traditional web scraping requires programming. Python with BeautifulSoup. Node.js with Puppeteer. Code-heavy approaches that need developers to build and maintain.

n8n offers something different:

  • Visual workflow building shows exactly how data flows
  • Native HTTP and HTML nodes handle most scraping without external services
  • Community nodes add headless browser support when needed
  • Self-hosting option means no per-request costs at any scale
  • 400+ integrations route scraped data anywhere

Whether you’re monitoring prices, generating leads, or aggregating content, n8n provides the building blocks for the entire pipeline.

What You’ll Learn

This guide covers everything for production-ready scraping:

  • Core scraping patterns using HTTP Request and HTML nodes
  • Static website extraction with CSS selectors and pagination handling
  • Dynamic content strategies for JavaScript-rendered pages
  • Anti-bot protection bypass techniques that actually work
  • AI-powered extraction when traditional selectors fail
  • Data cleaning workflows for production-quality output
  • Complete real-world examples you can adapt immediately
  • Troubleshooting techniques for common scraping failures

By the end, you’ll know exactly which approach to use for any website and how to build reliable, maintainable scraping automations.

Web Scraping Fundamentals in n8n

Every n8n scraping workflow follows a predictable pattern. Understanding this foundation makes building new scrapers straightforward.

The Three-Node Pattern

Most scraping workflows use three core components:

Trigger → HTTP Request → HTML Extract → Process/Store

Trigger starts the workflow. Schedule triggers run on a timer. Webhook triggers respond to external events. Manual triggers work for testing.

HTTP Request fetches the webpage. It downloads the raw HTML just like your browser does, but without rendering JavaScript or loading images.

HTML Extract parses the HTML and pulls out specific data using CSS selectors. It transforms messy HTML into clean JSON you can work with.

Everything after extraction is processing: cleaning data, transforming formats, filtering duplicates, and routing to destinations like databases, spreadsheets, or APIs.

Static vs. Dynamic Content

This distinction determines which approach you’ll use:

Content TypeHow to IdentifyApproach
Static HTMLView Page Source (Ctrl+U) shows the dataHTTP Request + HTML node
JavaScript-renderedData appears in browser but not in page sourceHeadless browser or API discovery
API-backedNetwork tab shows JSON requestsCall the API directly

To check a website:

  1. Open the page in your browser
  2. Press Ctrl+U (or Cmd+Option+U on Mac) to view source
  3. Search for the data you want to extract
  4. If it’s there, use the standard approach
  5. If it’s not, the site renders content with JavaScript

Most websites still serve static HTML for their main content. JavaScript rendering is more common on web applications (dashboards, SPAs) than on content sites, e-commerce, or directories.

Node Combinations for Different Scenarios

ScenarioNodesNotes
Simple page scrapeHTTP Request → HTMLWorks for most static sites
Multiple pagesSchedule → Loop → HTTP → HTML → AggregateAdd delays between requests
Login requiredHTTP Request (auth) → HTTP Request (page) → HTMLManage session cookies
JavaScript siteHTTP to headless service → HTMLUse Browserless, ScrapingBee, etc.
API availableHTTP Request (API endpoint)Skip HTML parsing entirely
AI extractionHTTP Request → AI nodeWhen structure is unpredictable

For detailed configuration of these nodes, see our HTTP Request node guide and HTML node documentation.

Static Website Scraping

Let’s build a complete scraping workflow step by step. We’ll scrape product information from a catalog page.

Step 1: Fetch the Page

Add an HTTP Request node:

  1. Set Method to GET
  2. Enter your target URL
  3. Click Test step to verify it returns HTML

The response contains the raw HTML in the data field.

Pro tip: Add request headers to appear more like a real browser:

HeaderValue
User-AgentMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Accepttext/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Languageen-US,en;q=0.5

These headers prevent many basic bot-detection systems from blocking your requests.

Step 2: Extract Data with CSS Selectors

Add an HTML node after HTTP Request:

  1. Set Operation to “Extract HTML Content”
  2. Set Source Data to JSON
  3. Set JSON Property to data (or whatever field contains your HTML)
  4. Add extraction values for each piece of data you need

Example extraction configuration:

KeyCSS SelectorReturn ValueReturn Array
productNames.product-card h2TextYes
prices.product-card .priceTextYes
links.product-card aAttribute (href)Yes
images.product-card imgAttribute (src)Yes

Enable “Return Array” when extracting multiple items of the same type. Without it, you only get the first match.

Step 3: Combine Parallel Arrays

The HTML node returns parallel arrays. To create structured product objects, add a Code node:

const items = $input.all();
const data = items[0].json;

// Combine parallel arrays into product objects
const products = data.productNames.map((name, index) => ({
  json: {
    name: name,
    price: data.prices[index],
    url: data.links[index],
    image: data.images[index]
  }
}));

return products;

Each product becomes a separate n8n item, ready for further processing or storage.

Step 4: Handle Pagination

Most sites split results across multiple pages. Handle pagination with a loop:

Option 1: Known page count

Use the Loop node with a fixed iteration count, building URLs like /products?page=1, /products?page=2, etc.

Option 2: Follow “Next” links

Extract the next page URL from each page and loop until no more pages exist.

Option 3: HTTP Request pagination

The HTTP Request node has built-in pagination support. Configure it under Options → Pagination to automatically follow page links.

Important: Add delays between requests using the Wait node. Hammering a server with rapid requests gets you blocked. Start with 1-2 seconds between requests and adjust based on the site’s response.

For large-scale scraping with rate limiting, see our API rate limits guide.

Step 5: Clean and Store Data

Before storing, clean the extracted data. Common cleanup tasks:

const items = $input.all();

return items.map(item => {
  const data = item.json;

  return {
    json: {
      // Remove currency symbols, convert to number
      price: parseFloat(data.price.replace(/[^0-9.]/g, '')),

      // Clean whitespace
      name: data.name.trim(),

      // Make URLs absolute
      url: data.url.startsWith('http')
        ? data.url
        : `https://example.com${data.url}`,

      // Add metadata
      scrapedAt: new Date().toISOString()
    }
  };
});

Then route to your destination: Google Sheets, Airtable, PostgreSQL, or any of n8n’s 400+ integrations.

For complex data transformation patterns, see our data transformation guide.

Dynamic Content and JavaScript-Rendered Sites

When HTTP Request returns HTML without your target data, you’re dealing with a JavaScript-rendered site. The content loads after initial page load through JavaScript execution.

Why Standard HTTP Fails

The HTTP Request node downloads raw HTML. It doesn’t:

  • Execute JavaScript
  • Wait for AJAX requests to complete
  • Render the page visually

Modern web applications (React, Vue, Angular) often render content entirely through JavaScript. The initial HTML is just a shell that says “load this JavaScript file,” and the actual content appears only after that JavaScript runs.

Solution 1: Find the Underlying API

Many JavaScript sites fetch data from APIs. You can call these APIs directly, bypassing JavaScript rendering entirely.

How to find the API:

  1. Open the page in Chrome
  2. Press F12 to open DevTools
  3. Go to the Network tab
  4. Filter by “Fetch/XHR”
  5. Refresh the page and interact with it
  6. Look for JSON responses containing your target data

Once you find the API endpoint, use HTTP Request to call it directly. This is faster and more reliable than any rendering approach.

Example: A product page might fetch data from https://api.example.com/products/123. Call that endpoint directly instead of scraping the HTML.

Solution 2: Headless Browser Services

When no API exists, use a headless browser service. These services run a real browser, execute JavaScript, and return the fully-rendered HTML.

Popular options:

ServiceHow It WorksPricing Model
BrowserlessHosted Puppeteer/PlaywrightPer-minute or flat rate
ScrapingBeeManaged scraping APIPer-request
FirecrawlAI-powered extractionPer-request
ApifyFull scraping platformPer-compute-unit

Integration pattern:

  1. Use HTTP Request to call the service’s API
  2. Send your target URL in the request
  3. Receive rendered HTML in the response
  4. Parse with the HTML node as normal
// Example ScrapingBee request body
{
  "url": "https://javascript-heavy-site.com/page",
  "render_js": true,
  "wait": 2000
}

The service handles JavaScript execution, and you get back HTML that contains all the data.

Solution 3: Community Nodes

The n8n community has built nodes for browser automation:

  • Puppeteer nodes provide direct headless Chrome control
  • Playwright nodes offer cross-browser automation

Search the n8n community nodes for current options. Installation requires self-hosted n8n with community nodes enabled.

Comparison: Which Approach to Use

ApproachBest ForDrawbacks
API DiscoverySites with clean APIsRequires investigation time
Headless ServicesOccasional dynamic scrapingPer-request costs
Community NodesHigh-volume dynamic scrapingRequires self-hosting, more setup
Code + PuppeteerMaximum controlRequires programming knowledge

Start with API discovery. It’s faster, cheaper, and more reliable. Only move to headless rendering when no API exists.

Bypassing Anti-Bot Protection

Websites invest in blocking automated access. Understanding these protections helps you work around them ethically.

Why Sites Block Scrapers

Legitimate reasons exist:

  • Prevent server overload from aggressive crawling
  • Protect against content theft and competitive scraping
  • Reduce costs from serving non-human traffic
  • Enforce terms of service

When scraping, respect the site’s resources. Slow down requests. Don’t scrape content you don’t have rights to use. Check if the site offers an official API.

Headers and User-Agent Configuration

The simplest protection checks User-Agent headers. Requests without browser-like headers get blocked immediately.

Configure realistic headers in HTTP Request:

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Connection: keep-alive

Rotate User-Agent strings occasionally to avoid fingerprinting. Maintain a list of current browser User-Agents and select randomly.

Rate Limiting and Delays

Aggressive request patterns trigger rate limiting faster than anything else.

Best practices:

  1. Add Wait nodes between requests (1-5 seconds minimum)
  2. Randomize delays slightly to appear more human
  3. Respect Retry-After headers when rate limited
  4. Use batch processing to spread requests over time

For rate limiting patterns, see our batch processing guide.

Proxy Rotation

Some sites block by IP address. Proxy rotation sends each request from a different IP.

Proxy options:

TypeUse CaseConsiderations
Datacenter proxiesBasic protection bypassCheap but often detected
Residential proxiesModerate protectionMore expensive, higher success
Mobile proxiesStrong protectionMost expensive, highest success

Configure proxies in HTTP Request under Options → Proxy.

Many headless browser services include proxy rotation in their API, simplifying setup.

When to Use Managed Services

For sites with strong protection (Cloudflare, PerimeterX, DataDome), managed scraping services are often the only practical solution.

These services:

  • Maintain proxy pools
  • Handle CAPTCHA solving
  • Manage browser fingerprints
  • Rotate all detectable parameters

The per-request cost is justified when:

  • You’re scraping high-protection sites
  • Volume is low enough that costs remain reasonable
  • Building and maintaining your own infrastructure isn’t worth the effort

Respecting robots.txt

Check robots.txt before scraping any site. This file at /robots.txt declares which paths crawlers should avoid.

While robots.txt isn’t legally binding, respecting it:

  • Demonstrates good faith
  • Reduces chance of IP blocks
  • Avoids scraping pages that might cause issues

Read more about the standard at robotstxt.org.

AI-Powered Web Scraping

Traditional scraping relies on CSS selectors. When page structure is unpredictable or constantly changing, AI extraction offers an alternative.

When AI Beats CSS Selectors

ScenarioCSS SelectorsAI Extraction
Consistent page structureBetterOverkill
Structure varies by pageFragileRobust
Dynamic class namesRequires maintenanceHandles automatically
Unstructured text contentCan’t parseNatural language capable
New sites frequentlySelector per siteOne prompt handles all

AI extraction shines when you’re scraping many different sites or when the site structure changes frequently.

Integration Pattern

Combine HTTP Request with AI nodes:

HTTP Request → HTML (extract raw content) → AI Node → Process

Example prompt for product extraction:

Extract the following information from this product page HTML:
- Product name
- Price (as a number without currency symbol)
- Description (first paragraph only)
- Availability status

Return as JSON:
{
  "name": "...",
  "price": 29.99,
  "description": "...",
  "inStock": true
}

HTML:
{{ $json.data }}

The AI parses the content and returns structured data regardless of the specific HTML structure.

Cost Considerations

AI extraction costs more than CSS selectors:

  • Each extraction requires an API call to your AI provider
  • Token usage scales with HTML size
  • Latency is higher than local parsing

Optimize costs:

  1. Strip unnecessary HTML before sending (remove scripts, styles, navigation)
  2. Use faster/cheaper models for simple extractions
  3. Batch multiple extractions when possible
  4. Cache results to avoid re-processing identical pages

For more on AI integration patterns, see our AI Agent guide.

Hybrid Approach

Combine CSS selectors for consistent elements and AI for variable content:

// In Code node after HTML extraction
const structured = {
  // CSS-extracted (reliable structure)
  title: $json.title,
  price: $json.price,

  // AI will extract from remaining HTML
  rawDescription: $json.descriptionHtml
};

return [{ json: structured }];

Then send rawDescription to an AI node for intelligent parsing while keeping reliable CSS extraction for stable elements.

Data Cleaning and Transformation

Raw scraped data is messy. Websites weren’t designed for machine consumption. Cleaning transforms raw extractions into production-quality data.

Common Data Issues

IssueExampleSolution
Whitespace" Product Name ".trim()
Currency symbols"$29.99"Regex replacement
Inconsistent formats"Dec 24, 2025" vs "2025-12-24"Date parsing
HTML entities"Tom & Jerry"Entity decoding
Relative URLs"/product/123"Prepend base URL
Missing valuesnull or undefinedDefault values

Code Node Cleanup Patterns

const items = $input.all();

return items.map(item => {
  const raw = item.json;

  return {
    json: {
      // Clean text
      name: (raw.name || '').trim(),

      // Parse price
      price: parseFloat((raw.price || '0').replace(/[^0-9.]/g, '')),

      // Parse date
      date: DateTime.fromFormat(raw.dateText, 'MMM d, yyyy').toISO(),

      // Make URL absolute
      url: raw.url?.startsWith('http')
        ? raw.url
        : `https://example.com${raw.url}`,

      // Default for missing values
      category: raw.category || 'Uncategorized',

      // Decode HTML entities
      description: raw.description
        ?.replace(/&/g, '&')
        ?.replace(/&lt;/g, '<')
        ?.replace(/&gt;/g, '>'),

      // Add metadata
      scrapedAt: new Date().toISOString(),
      source: 'website-scraper'
    }
  };
});

Deduplication

When scraping regularly, avoid storing duplicates. Use the Remove Duplicates node or implement in Code:

const items = $input.all();
const seen = new Set();
const unique = [];

for (const item of items) {
  const key = item.json.url; // Or product ID, etc.
  if (!seen.has(key)) {
    seen.add(key);
    unique.push(item);
  }
}

return unique;

Validation

Filter out invalid records before they reach your database:

const items = $input.all();

return items.filter(item => {
  const data = item.json;

  // Must have name and valid price
  if (!data.name || data.name.length < 2) return false;
  if (isNaN(data.price) || data.price <= 0) return false;

  // Must have valid URL
  if (!data.url?.startsWith('http')) return false;

  return true;
});

For more transformation patterns, see our comprehensive data transformation guide.

Production-Ready Scraping Workflows

Theory is useful. Real-world examples show how these concepts combine.

Example 1: E-commerce Price Monitoring

Use case: Track competitor prices daily and alert when they change significantly.

Workflow structure:

Schedule Trigger (daily)
  → Loop (for each competitor URL)
    → HTTP Request
    → HTML Extract (price, availability)
    → Code (clean data)
  → Aggregate
→ Compare Datasets (vs yesterday)
→ Filter (significant changes only)
→ Slack/Email notification
→ Database update

Key configuration:

  • Schedule for off-peak hours (early morning)
  • 2-second delays between requests per domain
  • Compare with previous day’s data using Compare Datasets node
  • Alert only when price changes exceed threshold (e.g., 5%)
  • Store historical data for trend analysis

Common pitfalls:

  • Price changes during sales events aren’t meaningful alerts
  • Currency/format changes can break parsing
  • Product availability should be tracked alongside price

Example 2: Lead Generation from Business Directories

Use case: Build prospect lists by scraping business directories.

Workflow structure:

Schedule Trigger (weekly)
  → HTTP Request (search results page)
  → HTML Extract (business links)
  → Loop (for each business)
    → HTTP Request (business detail page)
    → HTML Extract (name, phone, email, address)
    → Code (validate & clean)
  → Aggregate
→ Remove Duplicates
→ Filter (has email OR has phone)
→ CRM/Spreadsheet update

Key configuration:

  • Respect rate limits strictly on directories (they block aggressively)
  • Validate email format before storing
  • Deduplicate against existing contacts
  • Add source attribution for compliance

Common pitfalls:

  • Many directories block scrapers aggressively
  • Contact information may be outdated
  • Some directories have terms prohibiting scraping
  • Always verify leads before outreach

Example 3: Content Aggregation with AI Summary

Use case: Monitor industry news sources and create a daily digest.

Workflow structure:

Schedule Trigger (daily)
  → Loop (for each news source)
    → HTTP Request
    → HTML Extract (article links, titles, dates)
  → Aggregate all articles
→ Filter (today's articles only)
→ Loop (for each article)
  → HTTP Request (full article)
  → HTML Extract (content)
  → AI Summarization (3-sentence summary)
→ Aggregate
→ Generate HTML Template (digest format)
→ Send Email

Key configuration:

  • Extract publication dates to filter recent content only
  • AI prompt: “Summarize in 3 sentences for a business audience”
  • Template for readable email digest
  • Include source links for full articles

Common pitfalls:

  • Paywalled content won’t be accessible
  • AI costs accumulate with many articles
  • Summarization may miss nuance

For workflow architecture patterns, see our workflow best practices guide.

Troubleshooting Common Issues

When scraping fails, systematic debugging finds the problem.

Quick Diagnosis Table

SymptomLikely CauseSolution
Empty extraction resultsJavaScript-rendered contentCheck page source vs. browser view
403 ForbiddenMissing headers or IP blockedAdd browser headers, use proxy
429 Too Many RequestsRate limit exceededAdd delays, reduce frequency
Selector returns wrong dataSelector too genericMake selector more specific
Inconsistent resultsDynamic page elementsUse more stable selectors
Timeout errorsPage too large or slowIncrease timeout, optimize request
Encoding issuesCharacter set mismatchSpecify encoding in request

Debugging HTTP Request Failures

Step 1: Enable “Full Response” in Options to see status codes and headers.

Step 2: Check the response body for error messages. Many sites return helpful error descriptions.

Step 3: Test the same URL in your browser. If it works there but not in n8n, the issue is likely headers or bot detection.

Step 4: Compare your request headers with what the browser sends (visible in DevTools Network tab).

Debugging HTML Extraction Failures

Step 1: Pin the HTTP Request output and verify HTML contains your target data.

Step 2: Test your selector in the browser console: document.querySelectorAll('your-selector')

Step 3: If it works in browser but not n8n, the content is probably JavaScript-rendered.

Step 4: Check for dynamic class names that change between page loads.

Debugging Data Transformation

Step 1: Add console.log() statements in Code nodes:

const items = $input.all();
console.log('Input count:', items.length);
console.log('First item:', JSON.stringify(items[0], null, 2));

Check browser console (F12) for output.

Step 2: Process one item at a time before handling arrays.

Step 3: Use try-catch to identify which transformation fails:

try {
  // Your transformation
} catch (error) {
  console.log('Error:', error.message);
  return [{ json: { error: error.message, input: item.json }}];
}

For complex debugging scenarios, use our workflow debugger tool.

Best Practices for Reliable Scraping

Production scrapers need monitoring, error handling, and maintenance strategies.

Monitor for Structural Changes

Websites change their HTML without warning. When your selectors stop matching, extraction fails silently.

Detection strategies:

  • Log extraction counts and alert when they drop to zero
  • Validate extracted data format (prices should be numbers, etc.)
  • Run a validation workflow that checks sample extractions daily

Implement Error Handling

Use n8n’s error handling features:

  1. Enable Continue On Fail on HTTP Request nodes
  2. Add Error Trigger workflows for critical scrapers
  3. Log failures with full context for debugging
  4. Send alerts for repeated failures

Log Everything

Maintain records of:

  • When each scrape ran
  • How many items were extracted
  • Any errors encountered
  • Data samples for comparison

This historical data helps diagnose issues and prove compliance.

Web scraping legality varies by jurisdiction and use case. This isn’t legal advice, but general principles:

  • Public data is generally accessible
  • Terms of Service may restrict scraping
  • Rate limiting shows good faith
  • Personal data has special restrictions (GDPR, CCPA)
  • Copyright applies to content regardless of accessibility

When in doubt, consult legal counsel. Many businesses offer APIs or data partnerships as alternatives to scraping.

Frequently Asked Questions

How do I scrape JavaScript-rendered websites in n8n?

The standard HTTP Request node doesn’t execute JavaScript. You have three options:

1. Find the underlying API. Open browser DevTools (F12), go to Network tab, filter by “Fetch/XHR”, and look for JSON responses as the page loads. If you find the API endpoint, call it directly with HTTP Request. This is the fastest and most reliable approach.

2. Use a headless browser service. Services like Browserless, ScrapingBee, or Firecrawl render pages in real browsers and return the final HTML. Call their API with HTTP Request, then parse the returned HTML with the HTML node.

3. Install community nodes. Puppeteer or Playwright community nodes provide direct browser control. This requires self-hosted n8n with community nodes enabled.

Start with API discovery. It works in the majority of cases and eliminates the complexity and cost of browser rendering.


Can n8n bypass Cloudflare protection?

Native n8n nodes cannot bypass aggressive Cloudflare protection. Cloudflare’s challenge pages require JavaScript execution and sophisticated browser fingerprinting that HTTP Request doesn’t support.

Your options:

  • Check for APIs that might not have the same protection
  • Use managed scraping services (ScrapingBee, ScraperAPI) that specialize in bypass
  • Residential proxies with headless browser services increase success rates
  • Contact the site to request API access or whitelisting

Some Cloudflare-protected sites allow access with proper headers and reasonable request rates. Others block everything except known browsers. Test your specific target site.


What’s the difference between using HTTP Request + HTML node versus third-party scraping services?

HTTP Request + HTML node (native approach):

ProsCons
Free (no per-request costs)Can’t handle JavaScript rendering
Fast (no external API calls)No built-in proxy rotation
Full controlYou manage headers and rate limiting
Works offlineNo anti-bot bypass

Third-party scraping services:

ProsCons
Handle JavaScript renderingPer-request costs
Built-in proxy rotationExternal dependency
Anti-bot bypassSlower (network hop)
Managed infrastructureLess control

Use native nodes when: Sites serve static HTML, you control request volume, and cost matters.

Use services when: Sites use JavaScript rendering, have anti-bot protection, or you need high reliability without maintaining infrastructure.


How do I handle dynamic class names that change on every page load?

Modern frameworks (React, Next.js, Vue) often generate class names with random suffixes like ProductCard_title__x7Yz9. These change when the site rebuilds.

Solutions:

1. Use partial class matching:

[class^="ProductCard_title"]
[class*="ProductCard"]

2. Find stable identifiers:

  • data-testid attributes added for testing
  • id attributes
  • Semantic HTML elements (article, main, section)
  • Schema.org markup (itemtype, itemprop)

3. Use structural selectors:

.product-grid > div:first-child h2
main > section > article p

4. Extract parent element and parse with Code node: Get the outer HTML and use JavaScript string methods or regex to extract data regardless of class names.

5. Use AI extraction: When structure is truly unpredictable, AI can parse content without relying on specific selectors.

Monitor your scrapers regularly. Even stable selectors break when sites redesign.


Web scraping legality depends on jurisdiction, what you scrape, and how you use it. This is general information, not legal advice.

Generally permissible:

  • Scraping publicly accessible data
  • Personal use and research
  • Aggregating facts (not copyrighted expression)

Potentially problematic:

  • Violating Terms of Service (breach of contract)
  • Scraping personal data without consent (GDPR, CCPA)
  • Copying copyrighted content (copyright infringement)
  • Causing server damage through aggressive crawling (computer fraud laws)
  • Circumventing access controls (CFAA in US)

Best practices:

  • Check robots.txt and Terms of Service
  • Use reasonable request rates
  • Don’t scrape personal data without consent
  • Don’t republish copyrighted content
  • Consider official APIs or data partnerships

When scraping is business-critical, consult legal counsel familiar with your jurisdiction and use case.

When to Get Professional Help

Some scraping projects require specialized expertise:

  • High-volume enterprise scraping with strict reliability requirements
  • Complex anti-bot bypass for protected sites
  • Data pipeline architecture connecting scraped data to multiple systems
  • Compliance-sensitive scraping requiring legal review
  • AI-powered extraction for unstructured content at scale

Our workflow development services build production-ready scraping solutions tailored to your requirements. For architectural guidance on complex automation projects, explore our n8n consulting services.

Ready to Automate Your Business?

Tell us what you need automated. We'll build it, test it, and deploy it fast.

âś“ 48-72 Hour Turnaround
âś“ Production Ready
âś“ Free Consultation
⚡

Create Your Free Account

Sign up once, use all tools free forever. We require accounts to prevent abuse and keep our tools running for everyone.

or

You're in!

Check your email for next steps.

By signing up, you agree to our Terms of Service and Privacy Policy. No spam, unsubscribe anytime.

🚀

Get Expert Help

Add your email and one of our n8n experts will reach out to help with your automation needs.

or

We'll be in touch!

One of our experts will reach out soon.

By submitting, you agree to our Terms of Service and Privacy Policy. No spam, unsubscribe anytime.