Skip to main content
Back to Guides

Infrastructure

Cloudflare Workers for Web Scraping: A Developer Guide

Build scalable, serverless web scrapers using Cloudflare Workers. Learn how to leverage edge computing, browser rendering, and distributed data extraction.

18 min readUpdated January 15, 2025

Why Cloudflare Workers for Web Scraping?

Cloudflare Workers offers a unique platform for web scraping that combines the power of serverless computing with edge deployment. With data centers in over 300 locations worldwide, your scrapers run close to your targets, reducing latency and improving reliability.

Global Edge Network

Deploy scrapers to 300+ edge locations worldwide. Execute requests from locations close to your target servers for optimal performance.

Browser Rendering

Cloudflare Browser Rendering Service (based on Puppeteer) handles JavaScript-heavy sites at the edge.

Serverless Architecture

No server management. Auto-scaling handles traffic spikes automatically. Pay only for what you use.

Integrated Storage

D1 database, R2 object storage, and KV storage are built-in. Store scraped data without external dependencies.

Architecture Overview

A typical Cloudflare Workers scraping architecture consists of several components working together:

1

HTTP Handler Worker

Receives scraping requests, validates authentication, and queues jobs for processing.

2

Browser Rendering Service

Renders JavaScript-heavy pages using headless Chrome, returning the final HTML.

3

Extraction Worker

Processes rendered HTML and uses AI to extract structured data.

4

Storage Layer

Stores extracted data in D1, R2, or KV. Maintains job status and audit trails.

5

Webhook Delivery

Notifies your application when extraction is complete via signed webhooks.

Cloudflare Browser Rendering Service

The Browser Rendering Service (BRS) is Cloudflare's solution for running headless Chrome at the edge. It'\''s essential for scraping modern websites that rely heavily on JavaScript.

Using Browser Rendering in a Worker

import { onRequest } from '@cloudflare/puppeteer';

export async function onRequest(context) {
  const { env } = context;

  // Connect to Browser Rendering Service
  const browser = await env.BROWSER.connect();
  const page = await browser.newPage();

  try {
    // Navigate to the target page
    await page.goto('https://example.com/product', {
      waitUntil: 'networkidle'
    });

    // Wait for specific content to load
    await page.waitForSelector('.product-details');

    // Extract data
    const data = await page.evaluate(() => {
      return {
        title: document.querySelector('h1')?.textContent,
        price: document.querySelector('.price')?.textContent,
        description: document.querySelector('.description')?.textContent
      };
    });

    return Response.json(data);
  } finally {
    await browser.disconnect();
  }
}

Browser Rendering Limits

BRS has usage limits including concurrent connections and execution time. For high-volume scraping, implement a queue system and process jobs asynchronously.

Building a Scraper with AI Extraction

Combining Browser Rendering with AI extraction creates a powerful scraping pipeline. Here'\''s how to build it:

Worker Implementation

interface Env {
  BROWSER: Fetcher;
  AI: AiTextGeneration;
  DB: D1Database;
  API_KEY: string;
}

interface ScrapingRequest {
  url: string;
  fields: string[];
  instructions?: string;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Verify API key
    const apiKey = request.headers.get('x-api-key');
    if (apiKey !== env.API_KEY) {
      return new Response('Unauthorized', { status: 401 });
    }

    // Parse request
    const { url, fields, instructions }: ScrapingRequest = await request.json();

    // Render page with browser
    const browser = await env.BROWSER.connect();
    const page = await browser.newPage();

    try {
      await page.goto(url, { waitUntil: 'domcontentloaded' });
      await page.waitForTimeout(2000); // Allow JS to execute

      // Get page content
      const content = await page.evaluate(() => {
        return document.body.innerText;
      });

      // Extract using AI
      const prompt = `Extract the following fields from this page content: ${fields.join(', ')}.
${instructions ? instructions : ''}
Return the result as JSON.

Page content:
${content.slice(0, 10000)}`; // Token limit

      const aiResponse = await env.AI.run(prompt, {
        type: 'json'
      });

      // Store result
      await env.DB.prepare(
        'INSERT INTO scraping_jobs (url, result, created_at) VALUES (?, ?, ?)'
      ).bind(url, JSON.stringify(aiResponse), Date.now()).run();

      return Response.json(aiResponse);
    } finally {
      await browser.disconnect();
    }
  }
};

Storage Options for Scraped Data

Cloudflare offers multiple storage options, each suited for different use cases:

D1 Database

SQLite-based database ideal for structured data. Perfect for storing scraped products, articles, or business listings.

// Store scraped data in D1
await env.DB.prepare(`
  INSERT INTO products (title, price, url, scraped_at)
  VALUES (?, ?, ?, ?)
`).bind(
  data.title,
  data.price,
  url,
  Date.now()
).run();

// Query data
const results = await env.DB.prepare(
  'SELECT * FROM products WHERE price < ? ORDER BY price DESC'
).bind(100).all();

R2 Object Storage

S3-compatible object storage. Best for storing raw HTML, screenshots, or large volumes of scraped content.

// Store raw HTML for later processing
const htmlKey = 'scrapes/' + date + '/' + encodeURIComponent(targetUrl) + '.html';
await env.R2.put(htmlKey, htmlContent, {
  customMetadata: {
    url: targetUrl,
    scrapedAt: new Date().toISOString(),
    status: 'success'
  }
});

// Retrieve later
const stored = await env.R2.get(htmlKey);
const content = await stored.text();

KV Storage

Low-latency key-value store with global replication. Ideal for caching, rate limiting, and job status tracking.

// Track scraping job status
await env.JOBS.put(jobId, JSON.stringify({
  status: 'processing',
  startedAt: Date.now()
}), { metadataTtl: 3600 });

// Implement rate limiting
const key = `ratelimit:${apiKey}`;
const count = await env.RATE_LIMIT.get(key);
if (count && parseInt(count) > 100) {
  return new Response('Rate limit exceeded', { status: 429 });
}
await env.RATE_LIMIT.put(key, (parseInt(count || '0') + 1).toString(), {
  expirationTtl: 60
});

Implementing Scheduled Scraping

Use Cloudflare Cron Triggers to run scrapers on a schedule. Perfect for periodic data collection like price monitoring or news aggregation.

// wrangler.toml
// [triggers]
// crons = ["0 * * * *"] // Every hour

export interface Env {
  BROWSER: Fetcher;
  AI: AiTextGeneration;
  R2: R2Bucket;
}

export default {
  async scheduled(event: ScheduledEvent, env: Env): Promise<void> {
    const urls = [
      'https://example.com/products/1',
      'https://example.com/products/2',
      'https://example.com/products/3'
    ];

    for (const url of urls) {
      try {
        const data = await scrapeUrl(url, env);

        // Store with timestamp
        await env.R2.put(
          `products/${Date.now()}/${encodeURIComponent(url)}.json`,
          JSON.stringify(data)
        );

        // Optional: Send webhook
        await fetch('https://your-app.com/webhook', {
          method: 'POST',
          body: JSON.stringify({ url, data })
        });
      } catch (error) {
        console.error(`Failed to scrape ${url}:`, error);
      }
    }
  }
};

async function scrapeUrl(url: string, env: Env): Promise<any> {
  const browser = await env.BROWSER.connect();
  const page = await browser.newPage();

  try {
    await page.goto(url, { waitUntil: 'networkidle' });
    const content = await page.evaluate(() => document.body.innerText);

    // Use AI to extract structured data
    const result = await env.AI.run(
      `Extract product title, price, and availability from: ${content.slice(0, 5000)}`,
      { type: 'json' }
    );

    return result;
  } finally {
    await browser.disconnect();
  }
}

Best Practices for Production Scrapers

1. Implement Retry Logic

Network requests fail. Browser rendering times out. Implement exponential backoff for resilience.

async function retryWithBackoff<T>(
    fn: () => Promise<T>,
    maxRetries = 3
  ): Promise<T> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      const delay = Math.pow(2, i) * 1000;
      await new Promise(r => setTimeout(r, delay));
    }
  }
  throw new Error('Max retries exceeded');
}

2. Use Queues for High Volume

For large-scale scraping, use Cloudflare Queues to decouple request ingestion from processing.

3. Monitor and Alert

Integrate with monitoring services to track success rates, response times, and error patterns.

4. Respect Rate Limits

Implement delays between requests and respect target servers. Use KV for distributed rate limiting.

5. Handle Authentication

For sites requiring authentication, pass cookies and headers through the browser context.

await page.setCookie({
  name: 'session',
  value: sessionToken,
  domain: 'example.com'
});

await page.setExtraHTTPHeaders({
  'Authorization': `Bearer ${token}`
});

Using RobotScraping.com Instead

Building and maintaining your own Cloudflare Workers scraping infrastructure is complex. RobotScraping.com handles all the complexity for you:

  • -No infrastructure to manage: We handle browser rendering, AI extraction, and storage
  • -Built-in proxy rotation: Residential and datacenter proxies included
  • -Scheduled scraping: Built-in cron with webhook delivery
  • -Durable storage: Your data is stored with full audit trails
  • -Simple API: One endpoint for all your scraping needs
// Compare: RobotScraping.com API (much simpler)
const response = await fetch('https://api.robotscraping.com/extract', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'x-api-key': process.env.API_KEY
  },
  body: JSON.stringify({
    url: 'https://example.com/product',
    fields: ['title', 'price', 'availability']
  })
});

const data = await response.json();
// Done! No Workers to deploy, no infrastructure to manage.

Related Guides