The data you need is sitting right there on a webpage, but getting it into your workflow feels impossible. Product prices, contact information, article headlines, table data. You can see it in your browser, but extracting it programmatically has always required coding skills or expensive third-party tools.
The HTML node changes that. It transforms raw HTML into structured JSON data using CSS selectors, the same targeting system browsers use to style web pages. Combined with the HTTP Request node for fetching pages, you can scrape data from virtually any website without writing code.
The Web Scraping Challenge
Web scraping sounds simple until you try it:
- Websites structure their HTML differently, with no standard format
- The data you want is buried inside nested elements
- Class names and IDs vary between sites (and sometimes between page loads)
- Some content only appears after JavaScript executes
The HTML node handles the parsing side of this equation. Once you understand CSS selectors and the node’s configuration options, you can extract almost any visible text or attribute from a webpage.
What You’ll Learn
- How to use all three HTML node operations: extraction, template generation, and table conversion
- CSS selector fundamentals for targeting specific page elements
- Finding reliable selectors using browser developer tools
- Handling common extraction failures and edge cases
- Building maintainable scraping workflows that survive website changes
- Real-world examples for e-commerce, content aggregation, and report parsing
When to Use the HTML Node
The HTML node serves three distinct purposes. Understanding which operation you need prevents wasted time.
| Scenario | Operation | Why |
|---|---|---|
| Scrape product prices from a website | Extract HTML Content | CSS selectors target specific elements |
| Pull article titles and links from a news site | Extract HTML Content | Extract multiple values per page |
| Parse table data from an HTML report | Extract HTML Content | Tables are just nested HTML elements |
| Generate dynamic email templates | Generate HTML Template | Merge workflow data into HTML structure |
| Create HTML reports from JSON data | Convert to HTML Table | Automatically formats data as tables |
| Process HTML files from disk or email | Extract HTML Content | Works with binary HTML files too |
Rule of thumb: Use “Extract HTML Content” for web scraping and parsing. Use the other operations for generating HTML output from your workflow data.
When Not to Use the HTML Node
The HTML node has specific limitations:
| Limitation | Alternative |
|---|---|
| JavaScript-rendered content (SPAs) | Headless browser services or community nodes |
| Complex PDF parsing | Extract from File node or OCR services |
| API data (already JSON) | Process JSON directly with Edit Fields |
| Structured XML feeds | XML parsing nodes or Code node |
| Login-protected pages | HTTP Request with session cookies first |
Understanding the Three Operations
The HTML node offers three operations, each serving a different purpose.
Extract HTML Content
This is the primary web scraping operation. It parses HTML content and extracts specific data using CSS selectors.
Input: HTML content (from HTTP Request, file, or previous node)
Output: Extracted values as JSON properties
Use cases:
- Scraping prices, titles, descriptions from websites
- Extracting links and their text content
- Parsing structured data from HTML tables
- Pulling metadata from web pages
Generate HTML Template
This operation creates HTML output by merging workflow data into a template. You write HTML with n8n expressions embedded, and the node renders the final output.
Input: JSON data from previous nodes
Output: Rendered HTML string
Use cases:
- Creating dynamic email bodies
- Generating HTML reports
- Building HTML snippets for further processing
- Creating formatted output for webhooks
Quick example:
<h1>Order Confirmation</h1>
<p>Hi {{ $json.customerName }},</p>
<p>Your order #{{ $json.orderId }} for {{ $json.itemCount }} items
totaling ${{ $json.total }} has been confirmed.</p>
The node replaces expressions with actual values from your workflow data.
Convert to HTML Table
This operation transforms JSON array data into an HTML table format automatically, without writing any HTML.
Input: JSON array (multiple items)
Output: HTML table string with headers based on JSON keys
Use cases:
- Converting spreadsheet-style data to HTML
- Creating simple HTML reports
- Formatting data for email or display
Quick example:
If your input is:
[
{ "name": "Widget A", "price": "$10", "stock": 50 },
{ "name": "Widget B", "price": "$25", "stock": 12 }
]
The node outputs a complete HTML table with name, price, and stock columns, ready to embed in emails or reports.
Your First HTML Extraction
Let’s build a complete scraping workflow step by step.
Step 1: Fetch the Web Page
First, use the HTTP Request node to retrieve the page’s HTML:
- Add an HTTP Request node to your workflow
- Set Method to GET
- Enter a URL (for testing, use
https://books.toscrape.com/) - Click Test step
The node returns the page’s HTML content in the response body.
Step 2: Add the HTML Node
- Add an HTML node after HTTP Request
- Set Operation to “Extract HTML Content”
- For Source Data, select “JSON”
- Set JSON Property to the field containing your HTML (typically
datafrom HTTP Request)
Step 3: Configure Extraction Values
Now define what to extract. Click Add Value in the Extraction Values section:
Extracting a page title:
| Setting | Value |
|---|---|
| Key | pageTitle |
| CSS Selector | h1 |
| Return Value | Text |
Extracting a product price:
| Setting | Value |
|---|---|
| Key | price |
| CSS Selector | .price_color |
| Return Value | Text |
Step 4: Test and Verify
Click Test step. The output should contain your extracted values:
{
"pageTitle": "All products",
"price": "ÂŁ51.77"
}
You can now use these values in subsequent nodes with expressions like {{ $json.pageTitle }}.
CSS Selectors: The Complete Guide
CSS selectors are patterns that identify HTML elements. The same selectors used to style webpages also work for extraction.
Finding Selectors with Browser DevTools
The fastest way to find the right selector:
- Open the webpage in Chrome (or any browser)
- Right-click the element you want to extract
- Select Inspect to open DevTools
- In the Elements panel, right-click the highlighted HTML
- Choose Copy > Copy selector
This gives you a precise selector for that element. However, auto-generated selectors are often overly specific. You may need to simplify them.
Basic Selector Patterns
| Selector | Matches | Example |
|---|---|---|
tagname | All elements of that type | h1 matches all <h1> elements |
.classname | Elements with that class | .product-title matches <div class="product-title"> |
#idname | Element with that ID | #main-content matches <div id="main-content"> |
tag.class | Tag with specific class | p.description matches <p class="description"> |
parent child | Descendant elements | div p matches <p> inside <div> |
parent > child | Direct children only | ul > li matches immediate <li> children |
Attribute Selectors
Target elements by their attributes:
| Selector | Matches | Use Case |
|---|---|---|
[attr] | Has attribute | [href] matches all links |
[attr="value"] | Exact attribute value | [type="email"] matches email inputs |
[attr^="start"] | Attribute starts with | [href^="https"] matches HTTPS links |
[attr$="end"] | Attribute ends with | [href$=".pdf"] matches PDF links |
[attr*="contains"] | Attribute contains | [class*="price"] matches classes containing “price” |
Combining Selectors
Build precise selectors by combining patterns:
/* Element with multiple classes */
div.product.featured
/* Element with class inside another element */
article.post h2.title
/* Element with specific attribute inside class */
.product-card a[href*="/product/"]
/* Multiple comma-separated selectors */
h1, h2, h3
Pseudo-Selectors (Position-Based)
Select elements by their position:
| Selector | Matches |
|---|---|
:first-child | First child element |
:last-child | Last child element |
:nth-child(n) | nth child (1-indexed) |
:nth-child(odd) | Odd-numbered children |
:nth-child(even) | Even-numbered children |
Example: .product-list li:first-child selects the first product in a list.
Important: Some advanced pseudo-selectors like :nth-child(n+4) may not work as expected in n8n. Test thoroughly before relying on complex selectors.
CSS Selector Quick Reference
| Goal | Selector |
|---|---|
| All paragraphs | p |
| Element by ID | #header |
| Element by class | .product-name |
| All links | a |
| Links with specific text | a[href*="product"] |
| First item in list | ul li:first-child |
| All table rows | table tr |
| Specific data attribute | [data-product-id] |
| Multiple selectors | h1, h2, h3 |
For comprehensive CSS selector documentation, see MDN’s CSS Selectors guide.
Extraction Configuration Deep Dive
Understanding every configuration option prevents extraction failures.
Source Data Options
| Option | When to Use |
|---|---|
| JSON | HTML is in a JSON property (most common with HTTP Request) |
| Binary | HTML is in a binary property (file upload, attachment) |
For HTTP Request responses, use JSON and specify the property name containing HTML (usually data).
Return Value Types
This critical setting determines what gets extracted:
| Return Value | Extracts | Example Output |
|---|---|---|
| Text | Text content only (no HTML tags) | "Product Name" |
| HTML | Inner HTML including child elements | "<span>Product</span> Name" |
| Attribute | Value of a specific attribute | "/products/123" (from href) |
Text is the default and works for most extractions. Use Attribute when you need link URLs, image sources, or data attributes.
Attribute extraction example:
| Setting | Value |
|---|---|
| CSS Selector | a.product-link |
| Return Value | Attribute |
| Attribute | href |
Extraction Options
| Option | Effect |
|---|---|
| Trim Values | Removes whitespace from start and end |
| Clean Up Text | Removes line breaks and condenses multiple spaces |
| Return Array | Returns all matches as array instead of first match only |
Return Array is essential when scraping lists. Without it, you only get the first matching element.
Example: To extract all product names on a page:
| Setting | Value |
|---|---|
| CSS Selector | .product-name |
| Return Value | Text |
| Return Array | Enabled |
Output:
{
"productNames": ["Widget A", "Widget B", "Widget C"]
}
Multiple Extraction Values
Add multiple extraction values to pull different data in a single node:
| Key | CSS Selector | Return Value |
|---|---|---|
title | h1.product-title | Text |
price | .current-price | Text |
imageUrl | .product-image img | Attribute (src) |
description | .product-description | Text |
rating | .star-rating | Attribute (class) |
All values appear in the same output object.
Real-World Scraping Examples
Example 1: E-commerce Product Scraping
Scenario: Extract product information from an online store.
Workflow:
Schedule Trigger → HTTP Request (product page) → HTML Extract → Edit Fields (clean data) → Airtable/Sheets
HTTP Request:
- URL:
https://store.example.com/products/widget-123 - Method: GET
HTML Extraction Values:
| Key | CSS Selector | Return Value |
|---|---|---|
productName | h1.product-title | Text |
currentPrice | .price-current | Text |
originalPrice | .price-original | Text |
availability | .stock-status | Text |
productImage | .product-gallery img | Attribute (src) |
Post-processing with Edit Fields:
// Clean price (remove currency symbol, convert to number)
{{ parseFloat($json.currentPrice.replace(/[^0-9.]/g, '')) }}
// Calculate discount percentage
{{ Math.round((1 - $json.currentPrice / $json.originalPrice) * 100) }}%
Example 2: News Article Aggregation
Scenario: Collect headlines and links from news sites for a daily digest.
Workflow:
Schedule Trigger → HTTP Request (news page) → HTML Extract → Split In Batches → Process Each
HTML Extraction:
| Key | CSS Selector | Return Value | Return Array |
|---|---|---|---|
headlines | article h2 a | Text | Yes |
links | article h2 a | Attribute (href) | Yes |
timestamps | article time | Attribute (datetime) | Yes |
Processing the arrays:
Use a Code node to combine the parallel arrays:
const items = $input.all();
const data = items[0].json;
// Combine parallel arrays into article objects
return data.headlines.map((headline, i) => ({
json: {
headline: headline,
url: data.links[i],
published: data.timestamps[i]
}
}));
Example 3: Table Data Extraction
Scenario: Parse pricing tables or data grids from web pages.
Workflow:
HTTP Request → HTML Extract (headers) → HTML Extract (rows) → Code (combine) → Output
Extracting table headers:
| Key | CSS Selector | Return Value | Return Array |
|---|---|---|---|
headers | table thead th | Text | Yes |
Extracting table rows:
| Key | CSS Selector | Return Value | Return Array |
|---|---|---|---|
rowData | table tbody tr | HTML | Yes |
Then use a Code node to parse row data into structured objects using the headers.
Example 4: Generating HTML Email Templates
Scenario: Create personalized HTML emails from workflow data.
Setup:
- Set Operation to “Generate HTML Template”
- Enter your HTML template with n8n expressions:
<div style="font-family: Arial, sans-serif; max-width: 600px;">
<h1>Hello {{ $json.firstName }}!</h1>
<p>Your order #{{ $json.orderId }} has shipped.</p>
<table style="width: 100%; border-collapse: collapse;">
<tr>
<td>Tracking Number:</td>
<td>{{ $json.trackingNumber }}</td>
</tr>
<tr>
<td>Estimated Delivery:</td>
<td>{{ $json.estimatedDelivery }}</td>
</tr>
</table>
<p>Thank you for your business!</p>
</div>
The output is a rendered HTML string ready for email nodes.
Handling JavaScript-Rendered Content
A critical limitation: the HTTP Request node fetches raw HTML before JavaScript executes. Modern websites using React, Vue, Angular, or other frameworks often render content client-side.
Symptoms of JavaScript-Rendered Content
- Your selector works in the browser but returns nothing in n8n
- Inspecting the HTTP Request output shows minimal HTML
- The page shows a loading spinner or “Enable JavaScript” message
- Content appears after a delay when you load the page manually
Solutions
1. Check for API endpoints
Many JavaScript sites fetch data from APIs. Open browser DevTools Network tab, filter by “Fetch/XHR”, and look for JSON responses. You may be able to call these APIs directly with HTTP Request.
2. Use headless browser services
Services like Browserless, ScrapingBee, or ScrapFly render pages in real browsers and return the final HTML.
3. Community nodes
Search the n8n community for nodes that support JavaScript rendering, such as those integrating with Puppeteer or Playwright.
4. External scraping APIs
Third-party scraping services handle JavaScript rendering and anti-bot measures for you.
For most static websites and server-rendered pages, the standard HTTP Request + HTML combination works perfectly.
CSS Selector Troubleshooting
When extraction fails, systematic debugging finds the problem.
Common Errors and Solutions
| Problem | Cause | Solution |
|---|---|---|
| Selector returns empty/null | Element doesn’t exist in raw HTML | Check if content is JavaScript-rendered |
| Selector works in browser, fails in n8n | Different HTML structure or JS-rendered | Compare HTTP Request output with browser source |
| Only first element extracted | Return Array disabled | Enable “Return Array” option |
| Wrong element selected | Selector too generic | Make selector more specific |
| Extraction includes unwanted text | Selector matches parent element | Target more specific child element |
| Special characters break selector | Unescaped characters in selector | Escape special characters or use attribute selector |
Dynamic Class Names (Next.js, React)
Modern frameworks often generate class names with random suffixes:
<div class="ProductCard_container__a1B2c">
The a1B2c part changes on every deployment, breaking your selector.
Solutions:
- Use partial attribute matching:
[class^="ProductCard_container"]
[class*="ProductCard_container"]
- Find stable identifiers:
Look for data- attributes, IDs, or semantic HTML that doesn’t change:
[data-testid="product-card"]
article[itemtype*="Product"]
- Use structural selectors:
Target elements by their position in stable parent structures:
.products-grid > div:first-child
main article:nth-child(2)
Spaces in Class Names
HTML elements can have multiple classes separated by spaces:
<div class="product featured sale">
To match this element, use any single class:
.product
.featured
.sale
Or combine them (element must have all):
.product.featured.sale
Common mistake: Using .product featured (with a space) which means “element with class featured inside element with class product”.
Selector Maintenance Strategies
- Prefer IDs over classes: IDs are typically more stable
- Use data attributes:
[data-product-id]is often added intentionally and stable - Avoid generated class names: Skip classes that look like random strings
- Test with multiple pages: Ensure selectors work across different page variations
- Document your selectors: Add comments explaining what each selector targets
- Monitor for failures: Set up error handling to alert you when extractions fail
For debugging expression issues, our workflow debugger tool can help identify problems.
Pro Tips and Best Practices
1. Always Test Selectors in Browser First
Before configuring the HTML node:
- Open the target page in your browser
- Press F12 to open DevTools
- Go to Console tab
- Run:
document.querySelectorAll('your-selector') - Verify the correct elements are selected
This catches selector errors before they reach your workflow.
2. Compare Browser Source vs HTTP Request
Sometimes the browser shows different HTML than HTTP Request returns:
- View Page Source (Ctrl+U) shows the raw HTML, similar to HTTP Request
- DevTools Elements panel shows the DOM after JavaScript modifications
Always compare your selectors against the raw source, not the rendered DOM.
3. Handle Missing Data Gracefully
Not every page will have every element. Use expressions to handle missing values:
{{ $json.price || 'Price not available' }}
{{ $json.rating ?? 0 }}
4. Respect Rate Limits
When scraping multiple pages:
- Add Wait nodes between requests
- Use the Split In Batches node
- Check if the site has robots.txt restrictions
- Consider using our rate limiting strategies
5. Set Proper Headers to Avoid Blocks
Some websites block requests that lack browser-like headers. In your HTTP Request node, add these headers under Options > Headers:
| Header | Value |
|---|---|
User-Agent | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 |
Accept | text/html,application/xhtml+xml |
This mimics a real browser request and prevents many basic blocks.
6. Build Incrementally
Start with one extraction value. Test it. Add the next. This isolates problems and prevents debugging complex configurations.
7. Use Meaningful Keys
Instead of value1, value2, use descriptive keys like productPrice, productTitle. This makes downstream processing clearer.
8. Combine with Code Node for Complex Parsing
When CSS selectors alone aren’t enough, extract raw HTML and parse it in a Code node:
const html = $json.rawHtml;
// Use regex for complex extraction
const priceMatch = html.match(/\$(\d+\.\d{2})/);
const price = priceMatch ? parseFloat(priceMatch[1]) : null;
return { price };
For expression validation, try our expression validator tool.
When to Get Help
Some scraping scenarios require specialized expertise:
- Anti-bot protection: Cloudflare, reCAPTCHA, or other blocking mechanisms
- Session-based content: Pages requiring login or complex cookies
- Large-scale scraping: Thousands of pages with rate limiting concerns
- Data transformation: Complex restructuring of scraped data
Our workflow development services can build production-ready scraping solutions. For architectural guidance, explore our n8n consulting services.
Frequently Asked Questions
Why does my CSS selector work in Chrome DevTools but return nothing in n8n?
This disconnect is almost always caused by JavaScript-rendered content. When you inspect a page in Chrome, you see the DOM after JavaScript has executed. When HTTP Request fetches the page, it gets the raw HTML before any JavaScript runs.
To verify: Press Ctrl+U in Chrome to view the raw source code. Search for the text or element you’re trying to extract. If it doesn’t exist in the source but appears in the DOM, the content is JavaScript-rendered.
Solutions:
- Check if the site has a public API that returns the data as JSON
- Use a headless browser service that executes JavaScript
- Look for community nodes that support JavaScript rendering
For static pages, ensure your selector exactly matches the HTML structure. Copy a selector from DevTools, but verify it against the raw source.
How do I scrape content that only appears after JavaScript loads?
The standard HTTP Request node cannot execute JavaScript. You have several options:
API discovery: Open DevTools Network tab, filter by “Fetch/XHR”, and watch the requests as the page loads. Many JavaScript sites load data from APIs that you can call directly.
Headless browser services: Services like Browserless, ScrapingBee, or Apify render pages in real browsers and return the final HTML. Use HTTP Request to call their APIs.
Server-side rendering detection: Some sites serve full HTML to search engine bots. Try adding a User-Agent header that mimics Googlebot.
The best solution depends on your specific target site. Start by investigating whether an API exists.
What is the difference between Text, HTML, and Attribute return values?
These options determine what data gets extracted from matched elements:
Text returns the visible text content only, stripping all HTML tags:
<p>Hello <strong>world</strong></p>
Returns: "Hello world"
HTML returns the inner HTML including all child elements and tags:
<p>Hello <strong>world</strong></p>
Returns: "Hello <strong>world</strong>"
Attribute returns the value of a specific attribute you specify:
<a href="/products/123" class="link">View Product</a>
With Attribute set to href, returns: "/products/123"
Use Text for readable content (titles, descriptions, prices). Use Attribute for URLs, image sources, data attributes, or class names.
How do I handle websites where class names change with each deployment?
Modern frameworks like Next.js, Nuxt, and others often generate class names with hash suffixes (Button_primary__x7Yz9) that change when the site is rebuilt.
Strategy 1: Partial class matching
Use CSS attribute selectors that match the beginning of the class:
[class^="Button_primary"]
[class*="ProductCard_title"]
Strategy 2: Find stable identifiers
Look for elements that developers add intentionally:
data-testidattributes (added for testing)idattributes- Semantic HTML5 elements (
article,main,nav) - Schema.org markup (
itemtype,itemprop)
Strategy 3: Structural selectors
If the page structure is stable even when classes change:
.product-grid > div > h2
main > section:first-child p
Strategy 4: XPath alternative
For very complex cases, extract the HTML and use a Code node with a DOM parsing library to find elements by text content or position.
Monitor your scrapers regularly, as websites can change structure at any time.
Can I extract multiple different elements with different selectors in one node?
Yes. The HTML node supports multiple extraction values in a single operation. Click Add Value to add additional extractions.
Each extraction value has its own:
- Key (output property name)
- CSS Selector
- Return Value type
- Options
All extracted values appear in the same output object:
{
"title": "Product Name",
"price": "$29.99",
"imageUrl": "/images/product.jpg",
"description": "A great product..."
}
For extractions that return arrays (with “Return Array” enabled), each array is a separate property:
{
"titles": ["Product 1", "Product 2", "Product 3"],
"prices": ["$10", "$20", "$30"],
"links": ["/p/1", "/p/2", "/p/3"]
}
To combine these parallel arrays into structured objects, use a Code node after extraction. This pattern is covered in the real-world examples section above.