Extract Data from PDFs in n8n: Text, Tables, and Scanned Documents
PDFs are the black boxes of automation. Critical business data sits trapped inside invoices, contracts, reports, and forms, completely invisible to your n8n workflows. Your automation can receive these files, move them around, even store them, but actually reading whatâs inside? Thatâs where most workflows hit a wall.
The n8n community forums overflow with PDF extraction questions. Users download files from APIs, receive them as email attachments, or pull them from cloud storage. The Extract from File node seems like the obvious solution. They configure it, run the workflow, and get⌠empty results. Or garbage text. Or everything except the table data they actually needed.
The PDF Problem
Not all PDFs are created equal, and this single fact causes 90% of extraction failures:
| PDF Type | How Itâs Created | Extraction Challenge |
|---|---|---|
| Text-based | Digitally created (Word, Google Docs, software exports) | Easy to extract, basic tools work |
| Image-based | Scanned paper documents | Appears as pictures, requires OCR |
| Mixed | Digital text with embedded images, charts, signatures | Partial extraction, may need multiple approaches |
That âsimpleâ PDF invoice your client sends might be a scanned document, an image exported from their accounting software, or a digitally-generated file. Each requires a different extraction approach. Using the wrong method returns empty results or unusable text.
The Decision You Need to Make
Before writing a single node, answer this question: What type of PDF are you working with?
Quick test: Open your PDF and try to select text with your cursor. If you can highlight individual words, itâs text-based. If selecting grabs the entire page like an image, itâs image-based or mixed.
| Your PDF Type | Recommended Method | Complexity |
|---|---|---|
| Text-based, simple layout | Native Extract from File | Low |
| Image-based (scanned) | OCR services | Medium |
| Tables, forms, complex layouts | AI vision models | Medium-High |
| Mixed documents | Combination approach | High |
What Youâll Learn
- How to extract text from standard PDFs using n8nâs built-in Extract from File node
- When and how to use OCR services for scanned documents
- AI vision patterns for extracting tables, forms, and complex layouts
- Complete workflow examples for invoice processing and batch document handling
- Troubleshooting the most common PDF extraction errors
- A decision framework for choosing the right approach
Understanding PDF Types Before You Extract
Jumping straight into extraction without understanding your source documents leads to wasted time and broken workflows. This section saves you from the trial-and-error approach that frustrates most users.
Text-Based PDFs
Created by: Word processors, Google Docs, software exports, âPrint to PDFâ functions
Characteristics:
- Text is selectable and copyable
- Search (Ctrl+F) works within the document
- Usually smaller file sizes
- Clean, consistent formatting
Extraction success rate: High with native tools
Text-based PDFs store actual character data. When you âseeâ the letter A, the file contains the character A, not a picture of the letter. This makes extraction straightforward.
Image-Based PDFs
Created by: Document scanners, photo-to-PDF apps, fax machines, screenshot captures
Characteristics:
- Text cannot be selected
- Search doesnât find any content
- Often larger file sizes
- May show scan artifacts, slight angles, or shadows
Extraction success rate: Zero with native tools, requires OCR
Image-based PDFs are essentially pictures wrapped in a PDF container. The âtextâ you see is pixels, not characters. The Extract from File node reads character data, so it returns nothing useful from these documents.
Mixed PDFs
Created by: Combining digital documents with scanned signatures, embedding charts or images, some invoice generators
Characteristics:
- Some text is selectable, some isnât
- Search finds partial content
- Inconsistent behavior across pages
Extraction success rate: Partial with native tools
Mixed PDFs present the hardest challenge. You might extract 80% of the text perfectly while missing critical data embedded in images. Invoice totals trapped in a scanned signature block, form fields rendered as graphics, or tables saved as images all create blind spots.
How to Identify Your PDF Type
The Selection Test:
- Open your PDF in any viewer (browser, Preview, Adobe Reader)
- Click and drag across text
- Observe the behavior:
- Individual words highlight = text-based
- Entire page/region highlights as one block = image-based
- Mixed highlighting = mixed PDF
The Search Test:
- Press Ctrl+F (Cmd+F on Mac)
- Search for a word you can see on the page
- Results:
- Found = text is extractable
- Not found = image-based or text in images
Run both tests on multiple pages. Some PDFs switch between text and image pages.
Method 1: Native PDF Extraction with Extract from File
For text-based PDFs, n8nâs built-in extraction works well. This method requires no external services, costs nothing, and runs entirely within your n8n instance.
When to Use Native Extraction
Good candidates:
- Digitally-generated reports and exports
- Software-created invoices (QuickBooks, Xero, FreshBooks PDFs)
- Documents saved from Word, Google Docs, or similar
- API responses that return PDF reports
Poor candidates:
- Scanned documents (will return empty)
- PDFs with tables you need as structured data (returns unstructured text)
- Forms where layout matters
- Image-heavy documents
Step-by-Step Workflow
The basic pattern connects a file source to the extraction node:
Trigger â Get PDF (HTTP Request/Gmail/etc.) â Extract from File â Process Data
1. Get the PDF file
Using HTTP Request for files from URLs:
- Set Method to GET
- Enter the PDF URL
- Response automatically handles binary data
Using Gmail for email attachments:
- Configure Gmail trigger or node
- Attachments appear as binary properties like
attachment_0
Using Read/Write Files from Disk (self-hosted only):
- Set Operation to âRead File(s) From Diskâ
- Enter the file path
2. Add Extract from File node
- Set Operation to âExtract from PDFâ
- Set Binary Property to match your source (usually âdataâ, or âattachment_0â for email)
- Run the node
3. Process the extracted text
The output is a single string containing all text from the PDF. For the official parameter reference, see the n8n documentation.
Parsing Extracted Text with Code
Raw PDF text extraction returns an unstructured string. To extract specific values, use the Code node with regular expressions:
// The extracted PDF text arrives as a single string
const text = $json.text;
// Extract invoice number using regex pattern
// Looks for "Invoice #:" or "Invoice Number:" followed by digits
const invoiceMatch = text.match(/Invoice\s*(?:#|Number)?:?\s*(\d+)/i);
const invoiceNumber = invoiceMatch ? invoiceMatch[1] : null;
// Extract total amount
// Handles formats like "$1,234.56" or "Total: 1234.56"
const totalMatch = text.match(/Total:?\s*\$?([\d,]+\.?\d*)/i);
const total = totalMatch
? parseFloat(totalMatch[1].replace(/,/g, ''))
: null;
// Extract date (various formats)
const dateMatch = text.match(/Date:?\s*(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})/i);
const date = dateMatch ? dateMatch[1] : null;
return {
invoiceNumber,
total,
date,
rawText: text // Keep for debugging
};
For more JavaScript patterns, see our Code node recipes guide.
Limitations of Native Extraction
Native extraction has hard limits you need to understand:
What it extracts:
- Paragraph text and headings
- Text in form fields
- Text from tables (but not table structure)
What it cannot extract:
- Images or graphics
- Text embedded in images
- Table structure (rows/columns)
- Form field labels vs values as structured data
Common failure scenarios:
- Scanned PDFs - Returns empty or garbage characters
- PDFs with tables - Text extracted but row/column relationships lost
- Password-protected PDFs - Extraction fails
- Image-heavy layouts - Missing critical information
When you hit these limits, move to OCR or AI vision methods.
Method 2: OCR for Scanned Documents
Optical Character Recognition (OCR) converts images of text into actual text characters. For scanned documents, faxes, or photo-captured paperwork, OCR is mandatory.
When to Use OCR
Required for:
- Scanned paper documents
- Photos of documents
- Faxed documents converted to PDF
- Any PDF where text selection doesnât work
Advantages over native extraction:
- Works on image-based content
- Modern OCR handles various fonts and qualities
- Many services provide additional features (layout detection, confidence scores)
AI-Enhanced OCR Services
Modern OCR services use AI to improve accuracy beyond traditional character recognition. These services connect to n8n via the HTTP Request node.
Common providers:
| Service | Strengths | Pricing Model |
|---|---|---|
| Mistral | High accuracy, markdown output, page splitting | Per-page |
| Google Cloud Document AI | Enterprise features, form extraction | Per-page |
| AWS Textract | Table extraction, form parsing | Per-page |
| Azure Form Recognizer | Pre-built models for invoices, receipts | Per-page |
Generic OCR API Pattern:
Get PDF â Convert to Base64 â HTTP Request (OCR API) â Parse Response
The base64 conversion step is often required because OCR APIs expect the file content encoded as a string rather than raw binary. For details on base64 encoding, see the MDN Base64 documentation.
Example HTTP Request configuration for OCR APIs:
// In a Code node before HTTP Request, prepare the payload
const binaryData = await this.helpers.getBinaryDataBuffer('data');
const base64Content = binaryData.toString('base64');
return {
json: {
file_content: base64Content,
file_type: 'pdf'
}
};
Then configure HTTP Request:
- Method: POST
- URL: Your OCR service endpoint
- Body Content Type: JSON
- Body: Reference the prepared payload
Self-Hosted OCR with Tesseract
For privacy-sensitive documents or air-gapped environments, Tesseract OCR provides open-source text recognition. Community nodes bring Tesseract into n8n.
When to use self-hosted:
- Sensitive documents that cannot leave your infrastructure
- High volume processing where per-page costs add up
- Environments without internet access
- Complete control over the OCR process
Setup considerations:
- Install a Tesseract community node (search n8n community nodes for âtesseractâ or âocrâ)
- The node accepts images or PDFs
- For PDFs, pages are processed individually
Confidence threshold tuning:
OCR returns confidence scores indicating how certain it is about each character. Tune thresholds based on your needs:
- Below 95%: Usually unusable, high error rate
- 95-97%: Acceptable for non-critical data
- 97.5-98.5%: Sweet spot for most applications
- Above 99%: High confidence, likely correct
// Filter OCR results by confidence
const results = $json.ocrResults;
const highConfidence = results.filter(r => r.confidence > 0.975);
Choosing an OCR Approach
| Criteria | Cloud OCR APIs | Self-Hosted (Tesseract) |
|---|---|---|
| Accuracy | Higher (AI-enhanced) | Good (improving with updates) |
| Setup | Quick (API key) | More complex (node install) |
| Cost | Per-page pricing | Infrastructure only |
| Privacy | Data leaves your server | Stays on-premise |
| Languages | Extensive support | Requires language packs |
| Features | Table extraction, forms | Basic text extraction |
Decision guidance:
- Choose cloud APIs when accuracy is critical, you need advanced features, or document volume is moderate
- Choose self-hosted when privacy is non-negotiable, volume is high, or you need offline capability
Method 3: AI Vision for Complex Documents
When PDFs contain tables, forms, or complex layouts where structure matters, AI vision models provide the most capable extraction. These models âseeâ the document like a human would and can understand spatial relationships.
When to Use AI Vision
Best suited for:
- Tables that need to become structured data
- Forms with labeled fields
- Invoices with line items
- Contracts with specific clauses
- Any document where layout conveys meaning
Advantages:
- Understands document structure, not just text
- Can follow instructions like âextract all line items as an arrayâ
- Handles mixed content (text, tables, images)
- Provides structured JSON output
Converting PDFs to Images (Required Step)
Most AI vision APIs accept images, not PDFs directly. You need an intermediate conversion step.
Why this limitation exists:
PDF is a complex format that vision APIs donât natively parse. Theyâre optimized for image analysis. Converting PDF pages to images (PNG, JPEG) lets the vision model âseeâ exactly what a human sees.
Option 1: PDF Conversion API Services
Several APIs accept PDFs and return images. The pattern uses HTTP Request:
// HTTP Request to a PDF-to-image API
// Method: POST
// URL: Your conversion service endpoint
// Body: multipart/form-data with the PDF file
// The response typically returns:
// - Array of base64 images (one per page)
// - Or URLs to download each page image
Common services include ConvertAPI, PDF.co, CloudConvert, and similar. Each has slightly different request formats but the concept is identical: send PDF, receive images.
Option 2: Self-Hosted Conversion with Docker
For privacy or cost control, run your own conversion service. Tools like Stirling-PDF or Gotenberg provide PDF-to-image conversion via HTTP API:
# Add to your docker-compose.yml alongside n8n
stirling-pdf:
image: frooodle/s-pdf:latest
ports:
- "8080:8080"
environment:
- DOCKER_ENABLE_SECURITY=false
Then call it from n8n:
// HTTP Request to your self-hosted service
// URL: http://stirling-pdf:8080/api/v1/convert/pdf/img
// Method: POST
// Body: multipart/form-data with PDF file
// Returns: ZIP containing PNG images of each page
Option 3: Direct Vision API Support (Newer Feature)
Some AI providers now accept PDFs directly in their vision APIs, eliminating the conversion step. Check your providerâs current documentation, as this capability is expanding. When available, you can skip conversion entirely and send the PDF as a base64 payload.
Workflow pattern:
Get PDF â Convert to Images â Loop Over Pages â AI Vision API â Combine Results
For single-page documents, skip the loop. For multi-page, use Split In Batches to process each page image sequentially.
For a ready-to-use implementation of AI-powered document extraction, see our AI Document Extraction workflow template.
Vision AI Extraction Pattern
The core pattern uses HTTP Request to call AI vision APIs (OpenAI, Anthropic Claude, Google Gemini, or others):
1. Prepare the image
Convert PDF page to base64-encoded image:
// After PDF-to-image conversion
const imageBuffer = await this.helpers.getBinaryDataBuffer('data');
const base64Image = imageBuffer.toString('base64');
const mimeType = 'image/png'; // or jpeg
return {
json: {
image: base64Image,
mimeType
}
};
2. Configure the API request
Structure your prompt to get structured output:
const prompt = `Analyze this invoice image and extract the following as JSON:
- vendor_name: The company that issued this invoice
- invoice_number: The unique invoice identifier
- invoice_date: Date in YYYY-MM-DD format
- due_date: Payment due date in YYYY-MM-DD format
- line_items: Array of objects with {description, quantity, unit_price, total}
- subtotal: Amount before tax
- tax: Tax amount
- total: Final total amount
Return ONLY valid JSON, no explanation.`;
3. Send to vision API
HTTP Request configuration varies by provider but follows a similar pattern:
- Method: POST
- Authentication: API key in header
- Body: JSON with image data and prompt
4. Parse the response
Vision APIs return JSON (usually inside a text response). Parse it for downstream use:
// Parse AI response
const aiResponse = $json.choices[0].message.content;
// The response may have markdown code blocks, strip them
const jsonString = aiResponse
.replace(/```json\n?/g, '')
.replace(/```\n?/g, '')
.trim();
try {
return JSON.parse(jsonString);
} catch (e) {
// If parsing fails, return raw for debugging
return {
error: 'JSON parse failed',
raw: aiResponse
};
}
If you encounter JSON formatting issues, our JSON Fixer tool can help diagnose problems.
Choosing a Vision AI Provider
| Provider | Strengths | Considerations |
|---|---|---|
| OpenAI | Widely used, good documentation | Token-based pricing |
| Anthropic (Claude) | Strong reasoning, handles complex layouts | Message-based pricing |
| Google (Gemini) | Competitive pricing, multimodal native | Newer, evolving features |
Decision factors:
- Accuracy needs: All major providers handle standard documents well
- Existing relationships: Use what you already have credentials for
- Cost structure: Compare based on your document volume
- Output format: Some handle JSON instructions better than others
For comprehensive vision API documentation, see OpenAI Vision or Anthropicâs documentation.
Real-World Use Case: Invoice Processing
Invoice extraction is the most common PDF automation use case. Hereâs a production-ready approach that handles real-world complexity.
The Challenge
Invoices arrive from multiple vendors, each with different layouts. A single extraction prompt fails because:
- Field positions vary (invoice number top-right vs top-left)
- Terminology differs (âInvoice #â vs âBill Numberâ vs âReferenceâ)
- Table structures change (some have quantity columns, others donât)
- Date formats vary by region
Provider Detection Pattern
Instead of one-size-fits-all extraction, detect the vendor first and apply vendor-specific logic:
Receive Invoice â Extract Vendor Name â Route by Vendor â Apply Specific Template â Output Structured Data
Step 1: Initial AI pass to identify vendor
const prompt = `Look at this invoice and identify:
1. The vendor/company name that issued this invoice
2. Return as JSON: {"vendor": "Company Name"}`;
Step 2: Route based on vendor
Use a Switch node to route to vendor-specific extraction prompts:
// Switch conditions
Vendor contains "Acme Corp" â Acme extraction prompt
Vendor contains "Global Supply" â Global Supply extraction prompt
Default â Generic extraction prompt
Step 3: Vendor-specific extraction
Each vendor path uses a tailored prompt that matches their invoice format. This dramatically improves accuracy.
Few-Shot Learning Approach
For vendors without predefined templates, include examples in your prompt:
const prompt = `Extract invoice data from this image.
Here's an example of the expected output format:
{
"vendor": "Example Corp",
"invoice_number": "INV-12345",
"date": "2024-03-15",
"line_items": [
{"description": "Widget A", "quantity": 10, "price": 5.00, "total": 50.00}
],
"total": 50.00
}
Now extract from the provided invoice, matching this structure exactly.`;
The example shows the AI your expected output format, improving consistency.
Handling Extraction Failures
Not every extraction succeeds. Build in error handling:
// After AI extraction
const extracted = $json.extractedData;
// Validate required fields
const required = ['vendor', 'invoice_number', 'total'];
const missing = required.filter(f => !extracted[f]);
if (missing.length > 0) {
// Route to manual review queue
return {
status: 'needs_review',
missing_fields: missing,
raw_extraction: extracted
};
}
return {
status: 'success',
data: extracted
};
For complete invoice processing workflows, check our Invoice Processing Automation template.
Processing Multiple PDFs (Batch Workflows)
Single-document workflows are straightforward. Batch processing hundreds of PDFs introduces new challenges.
Loop Patterns for Multiple Documents
Use the Split In Batches or Loop Over Items pattern:
Get File List â Loop Over Items â Extract Each â Aggregate Results
Key considerations:
- Error isolation - One failed document shouldnât stop the entire batch
- Rate limiting - External APIs have limits; pace your requests
- Memory management - Large PDFs consume memory; process sequentially for heavy files
- Progress tracking - Know which documents succeeded or failed
Error Handling Per Document
Wrap extraction in error handling that continues the batch:
// In a Code node, process with try-catch
const results = [];
for (const item of $input.all()) {
try {
// Your extraction logic
const extracted = await extractPdf(item);
results.push({
success: true,
filename: item.json.filename,
data: extracted
});
} catch (error) {
results.push({
success: false,
filename: item.json.filename,
error: error.message
});
}
}
return results;
Performance Optimization
For high-volume processing:
- Process documents in parallel where APIs allow
- Use queue-based architecture for very large batches (see our queue mode guide)
- Consider batch endpoints if your OCR/AI provider offers them
- Cache vendor detection results to skip redundant AI calls
For large individual files:
- Increase timeout settings for extraction nodes
- Monitor memory usage on your n8n instance
- Consider splitting multi-page PDFs before processing
For timeout issues, see our timeout troubleshooting guide.
For patterns on processing large datasets, see our batch processing guide.
Troubleshooting PDF Extraction
These are the most common issues users face, with proven solutions.
Empty Extraction Results
Symptom: Extract from File returns empty text or {"text": ""}
Causes:
-
Image-based PDF - Most common cause. Native extraction canât read image content.
- Fix: Switch to OCR or AI vision method
-
Password-protected PDF - Encrypted files fail silently
- Fix: Remove password protection before processing, or use services that handle encrypted PDFs
-
Corrupted file - Damaged during transfer
- Fix: Re-download or request new copy
Diagnostic step: Open the PDF, try to select text. If you canât select individual words, itâs image-based.
âBinary file âdataâ not foundâ Error
Symptom: Node throws error about missing binary property
Causes:
-
Property name mismatch - Binary named something other than âdataâ
- Fix: Check source node output, match property name in Extract from File
-
Binary data lost - Transform node dropped binary data
- Fix: Ensure Edit Fields uses âAppendâ mode, not âSetâ
-
No binary data present - Source node didnât output file content
- Fix: Verify source node configuration; check for API errors
For more on binary data handling, see the n8n binary data documentation.
Garbage Text or Wrong Characters
Symptom: Extraction returns random characters, symbols, or nonsensical text
Causes:
-
Mixed PDF treated as text-only - Image portions extracted as garbage
- Fix: Use AI vision for mixed documents
-
Font encoding issues - PDF uses embedded fonts that donât map correctly
- Fix: Try different extraction method; AI vision often handles this better
-
Scanned document at wrong angle - OCR struggles with rotated text
- Fix: Pre-process to correct orientation
Timeout Errors on Large PDFs
Symptom: Node times out before completing extraction
Causes:
-
Large file size - Multi-hundred page documents take time
- Fix: Increase node timeout, split document into chunks
-
External service slow - OCR or AI API responding slowly
- Fix: Check service status, increase timeout, implement retry logic
-
Insufficient resources - n8n instance overloaded
- Fix: Scale instance resources, use queue mode for distribution
Choosing the Right Approach: Decision Matrix
Use this matrix to select your extraction method:
| Document Characteristics | Recommended Method | Complexity | Cost |
|---|---|---|---|
| Text-based, simple layout | Native Extract from File | Low | Free |
| Text-based, need specific fields | Native + Code node regex | Low | Free |
| Scanned, text-only content | Cloud OCR API | Medium | Per-page |
| Scanned, high volume | Self-hosted Tesseract | Medium | Infrastructure |
| Tables that need structure | AI Vision | Medium-High | Per-request |
| Forms with labeled fields | AI Vision | Medium-High | Per-request |
| Mixed (text + images + tables) | AI Vision or combination | High | Per-request |
| Sensitive/regulated documents | Self-hosted OCR | Medium | Infrastructure |
Cost considerations:
- Native extraction - No additional cost, included in n8n
- Cloud OCR - Typically $0.001-0.01 per page
- AI Vision - Token-based pricing, varies by document complexity
- Self-hosted - Server costs only, no per-document fees
Accuracy vs simplicity tradeoff:
Start with the simplest method that meets your needs. Donât use AI vision for documents that native extraction handles fine. Scale up complexity only when simpler methods fail.
When to Get Expert Help
PDF extraction workflows range from simple to complex. Consider professional assistance when:
- Your documents span many formats requiring multiple extraction methods
- Accuracy requirements are high and errors are costly
- You need to process thousands of documents reliably
- Integration with downstream systems requires precise data structures
- Your team lacks bandwidth to build and maintain extraction workflows
Our workflow development service handles document processing automation. For strategic guidance on your extraction architecture, our consulting packages help design scalable solutions.
Frequently Asked Questions
Can n8n extract data from scanned PDFs?
Yes, but not with the basic Extract from File node. Scanned PDFs require OCR (Optical Character Recognition) to convert image content into text. You can integrate OCR services via HTTP Request nodes or use community OCR nodes. Cloud OCR APIs like those from Google, AWS, or Mistral provide high accuracy. For privacy-sensitive documents, self-hosted Tesseract offers on-premise processing.
Whatâs the difference between OCR and AI vision extraction?
OCR converts images of text into character data. It reads the text but doesnât understand document structure. Tables come out as unstructured text lines.
AI vision models âseeâ the document holistically. They understand that data in the top-right is probably the invoice number, items in rows belong together, and the number at the bottom is the total. Vision AI returns structured data matching how humans read documents.
Use OCR when you just need the text. Use AI vision when structure and relationships matter.
How do I extract tables from PDFs in n8n?
Native PDF extraction flattens tables into lines of text, losing row/column structure. For actual tabular data:
-
AI vision method - Send the PDF page as an image to a vision-capable AI. Prompt it to extract the table as a JSON array of objects. This preserves the row/column relationships.
-
Specialized services - Some OCR providers (AWS Textract, Google Document AI) have dedicated table extraction features that return structured data.
-
Post-processing - If tables have consistent formatting, you can sometimes reconstruct structure from text using pattern matching in a Code node, but this is fragile.
Why does my PDF extraction return empty results?
The most common cause is using native extraction on a scanned (image-based) PDF. The Extract from File node reads text character data. If the PDF is actually an image of text, thereâs no character data to read.
Quick diagnosis: Open the PDF and try to select text. If you canât highlight individual words, the PDF is image-based and requires OCR.
Other causes include password-protected files (extraction fails silently), corrupted documents, or binary property name mismatches in your workflow configuration.
How do I handle password-protected PDFs?
n8nâs native PDF extraction doesnât handle password protection. Options include:
- Remove protection first - If you have the password, use PDF tools or services to decrypt before sending to n8n
- API services - Some OCR and document processing APIs accept password as a parameter
- Pre-processing - Batch-decrypt documents before they enter your workflow
Password protection often exists for security reasons. Ensure you have proper authorization before bypassing document protection in automated workflows.