Infrastructure
Cloudflare Workers for Web Scraping: A Developer Guide
Build scalable, serverless web scrapers using Cloudflare Workers. Learn how to leverage edge computing, browser rendering, and distributed data extraction.
Why Cloudflare Workers for Web Scraping?
Cloudflare Workers offers a unique platform for web scraping that combines the power of serverless computing with edge deployment. With data centers in over 300 locations worldwide, your scrapers run close to your targets, reducing latency and improving reliability.
Global Edge Network
Deploy scrapers to 300+ edge locations worldwide. Execute requests from locations close to your target servers for optimal performance.
Browser Rendering
Cloudflare Browser Rendering Service (based on Puppeteer) handles JavaScript-heavy sites at the edge.
Serverless Architecture
No server management. Auto-scaling handles traffic spikes automatically. Pay only for what you use.
Integrated Storage
D1 database, R2 object storage, and KV storage are built-in. Store scraped data without external dependencies.
Architecture Overview
A typical Cloudflare Workers scraping architecture consists of several components working together:
HTTP Handler Worker
Receives scraping requests, validates authentication, and queues jobs for processing.
Browser Rendering Service
Renders JavaScript-heavy pages using headless Chrome, returning the final HTML.
Extraction Worker
Processes rendered HTML and uses AI to extract structured data.
Storage Layer
Stores extracted data in D1, R2, or KV. Maintains job status and audit trails.
Webhook Delivery
Notifies your application when extraction is complete via signed webhooks.
Cloudflare Browser Rendering Service
The Browser Rendering Service (BRS) is Cloudflare's solution for running headless Chrome at the edge. It'\''s essential for scraping modern websites that rely heavily on JavaScript.
Using Browser Rendering in a Worker
import { onRequest } from '@cloudflare/puppeteer';
export async function onRequest(context) {
const { env } = context;
// Connect to Browser Rendering Service
const browser = await env.BROWSER.connect();
const page = await browser.newPage();
try {
// Navigate to the target page
await page.goto('https://example.com/product', {
waitUntil: 'networkidle'
});
// Wait for specific content to load
await page.waitForSelector('.product-details');
// Extract data
const data = await page.evaluate(() => {
return {
title: document.querySelector('h1')?.textContent,
price: document.querySelector('.price')?.textContent,
description: document.querySelector('.description')?.textContent
};
});
return Response.json(data);
} finally {
await browser.disconnect();
}
}Browser Rendering Limits
BRS has usage limits including concurrent connections and execution time. For high-volume scraping, implement a queue system and process jobs asynchronously.
Building a Scraper with AI Extraction
Combining Browser Rendering with AI extraction creates a powerful scraping pipeline. Here'\''s how to build it:
Worker Implementation
interface Env {
BROWSER: Fetcher;
AI: AiTextGeneration;
DB: D1Database;
API_KEY: string;
}
interface ScrapingRequest {
url: string;
fields: string[];
instructions?: string;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// Verify API key
const apiKey = request.headers.get('x-api-key');
if (apiKey !== env.API_KEY) {
return new Response('Unauthorized', { status: 401 });
}
// Parse request
const { url, fields, instructions }: ScrapingRequest = await request.json();
// Render page with browser
const browser = await env.BROWSER.connect();
const page = await browser.newPage();
try {
await page.goto(url, { waitUntil: 'domcontentloaded' });
await page.waitForTimeout(2000); // Allow JS to execute
// Get page content
const content = await page.evaluate(() => {
return document.body.innerText;
});
// Extract using AI
const prompt = `Extract the following fields from this page content: ${fields.join(', ')}.
${instructions ? instructions : ''}
Return the result as JSON.
Page content:
${content.slice(0, 10000)}`; // Token limit
const aiResponse = await env.AI.run(prompt, {
type: 'json'
});
// Store result
await env.DB.prepare(
'INSERT INTO scraping_jobs (url, result, created_at) VALUES (?, ?, ?)'
).bind(url, JSON.stringify(aiResponse), Date.now()).run();
return Response.json(aiResponse);
} finally {
await browser.disconnect();
}
}
};Storage Options for Scraped Data
Cloudflare offers multiple storage options, each suited for different use cases:
D1 Database
SQLite-based database ideal for structured data. Perfect for storing scraped products, articles, or business listings.
// Store scraped data in D1 await env.DB.prepare(` INSERT INTO products (title, price, url, scraped_at) VALUES (?, ?, ?, ?) `).bind( data.title, data.price, url, Date.now() ).run(); // Query data const results = await env.DB.prepare( 'SELECT * FROM products WHERE price < ? ORDER BY price DESC' ).bind(100).all();
R2 Object Storage
S3-compatible object storage. Best for storing raw HTML, screenshots, or large volumes of scraped content.
// Store raw HTML for later processing
const htmlKey = 'scrapes/' + date + '/' + encodeURIComponent(targetUrl) + '.html';
await env.R2.put(htmlKey, htmlContent, {
customMetadata: {
url: targetUrl,
scrapedAt: new Date().toISOString(),
status: 'success'
}
});
// Retrieve later
const stored = await env.R2.get(htmlKey);
const content = await stored.text();KV Storage
Low-latency key-value store with global replication. Ideal for caching, rate limiting, and job status tracking.
// Track scraping job status
await env.JOBS.put(jobId, JSON.stringify({
status: 'processing',
startedAt: Date.now()
}), { metadataTtl: 3600 });
// Implement rate limiting
const key = `ratelimit:${apiKey}`;
const count = await env.RATE_LIMIT.get(key);
if (count && parseInt(count) > 100) {
return new Response('Rate limit exceeded', { status: 429 });
}
await env.RATE_LIMIT.put(key, (parseInt(count || '0') + 1).toString(), {
expirationTtl: 60
});Implementing Scheduled Scraping
Use Cloudflare Cron Triggers to run scrapers on a schedule. Perfect for periodic data collection like price monitoring or news aggregation.
// wrangler.toml
// [triggers]
// crons = ["0 * * * *"] // Every hour
export interface Env {
BROWSER: Fetcher;
AI: AiTextGeneration;
R2: R2Bucket;
}
export default {
async scheduled(event: ScheduledEvent, env: Env): Promise<void> {
const urls = [
'https://example.com/products/1',
'https://example.com/products/2',
'https://example.com/products/3'
];
for (const url of urls) {
try {
const data = await scrapeUrl(url, env);
// Store with timestamp
await env.R2.put(
`products/${Date.now()}/${encodeURIComponent(url)}.json`,
JSON.stringify(data)
);
// Optional: Send webhook
await fetch('https://your-app.com/webhook', {
method: 'POST',
body: JSON.stringify({ url, data })
});
} catch (error) {
console.error(`Failed to scrape ${url}:`, error);
}
}
}
};
async function scrapeUrl(url: string, env: Env): Promise<any> {
const browser = await env.BROWSER.connect();
const page = await browser.newPage();
try {
await page.goto(url, { waitUntil: 'networkidle' });
const content = await page.evaluate(() => document.body.innerText);
// Use AI to extract structured data
const result = await env.AI.run(
`Extract product title, price, and availability from: ${content.slice(0, 5000)}`,
{ type: 'json' }
);
return result;
} finally {
await browser.disconnect();
}
}Best Practices for Production Scrapers
1. Implement Retry Logic
Network requests fail. Browser rendering times out. Implement exponential backoff for resilience.
async function retryWithBackoff<T>(
fn: () => Promise<T>,
maxRetries = 3
): Promise<T> {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (i === maxRetries - 1) throw error;
const delay = Math.pow(2, i) * 1000;
await new Promise(r => setTimeout(r, delay));
}
}
throw new Error('Max retries exceeded');
}2. Use Queues for High Volume
For large-scale scraping, use Cloudflare Queues to decouple request ingestion from processing.
3. Monitor and Alert
Integrate with monitoring services to track success rates, response times, and error patterns.
4. Respect Rate Limits
Implement delays between requests and respect target servers. Use KV for distributed rate limiting.
5. Handle Authentication
For sites requiring authentication, pass cookies and headers through the browser context.
await page.setCookie({
name: 'session',
value: sessionToken,
domain: 'example.com'
});
await page.setExtraHTTPHeaders({
'Authorization': `Bearer ${token}`
});Using RobotScraping.com Instead
Building and maintaining your own Cloudflare Workers scraping infrastructure is complex. RobotScraping.com handles all the complexity for you:
- -No infrastructure to manage: We handle browser rendering, AI extraction, and storage
- -Built-in proxy rotation: Residential and datacenter proxies included
- -Scheduled scraping: Built-in cron with webhook delivery
- -Durable storage: Your data is stored with full audit trails
- -Simple API: One endpoint for all your scraping needs
// Compare: RobotScraping.com API (much simpler)
const response = await fetch('https://api.robotscraping.com/extract', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.API_KEY
},
body: JSON.stringify({
url: 'https://example.com/product',
fields: ['title', 'price', 'availability']
})
});
const data = await response.json();
// Done! No Workers to deploy, no infrastructure to manage.