How does AI web scraping work?

RobotScraping.com renders the target page using Cloudflare Browser Rendering, distills the content, and then uses LLMs (GPT-4o-mini or Claude Haiku) to extract structured data based on your specified fields or schema.

What websites can I scrape?

You can scrape any publicly accessible website. The service handles JavaScript-heavy sites, SPAs, and traditional HTML pages. Sites with aggressive bot detection may block requests.

Is the data extracted accurate?

AI extraction is resilient to layout changes but may occasionally make errors. We recommend validating critical data and using specific schemas for better accuracy.

Is web scraping legal?

Web scraping publicly available data is generally legal, but you should respect robots.txt, terms of service, and data privacy laws like GDPR and CCPA.

What LLMs do you support?

RobotScraping.com supports GPT-4o-mini and Claude Haiku for extraction. You can specify your preferred model in the API request.

How do webhooks work?

When using async mode, provide a webhook_url. The API will POST the extraction result to your endpoint when complete. Verify signatures using the x-webhook-secret header.

What are the rate limits?

Free tier: 5 requests/day. Pro tier: 1,000 requests/day. All requests are subject to fair use policies. Check X-RateLimit headers in responses.

Can I scrape JavaScript-heavy sites?

Yes! RobotScraping.com uses Cloudflare Browser Rendering to execute JavaScript and render dynamic content before extraction.

What happens when a site blocks scraping?

The API returns a 403 status with blocked: true in the response. Pro users can enable Proxy Grid fallback for improved success rates.

How do I get started?

Get an API key from the dashboard, then send a POST request to /extract with your target URL and desired fields. The playground on the homepage offers a no-code way to test the API.

Is there a SDK available?

The API is RESTful and works with any HTTP client. JavaScript and Python examples are available in the documentation.

How is pricing calculated?

Pricing is per-request. Free tier includes 5 requests/day. GitHub sign-in unlocks 50 requests/day. Paid plans are currently waitlist-only.

Can I schedule recurring scrapes?

Yes! Use the /schedules endpoint to create cron-based recurring extraction jobs. Results are delivered via webhook.

What happens if a site blocks the scraper?

The API returns a blocked status with error code "blocked". You can implement retry logic with different user agents or use proxy rotation for sensitive sites.

How do I handle authentication?

Include your API key via the x-api-key header. For scraping protected pages, you can pass custom headers and cookies in the options parameter.

What data formats are supported?

The API returns JSON by default. You can specify a JSON schema for structured output, or simply request specific fields to extract.

Back to Guides

Infrastructure

Cloudflare Workers for Web Scraping: A Developer Guide

Build scalable, serverless web scrapers using Cloudflare Workers. Learn how to leverage edge computing, browser rendering, and distributed data extraction.

18 min readUpdated January 15, 2025

Why Cloudflare Workers for Web Scraping?

Cloudflare Workers offers a unique platform for web scraping that combines the power of serverless computing with edge deployment. With data centers in over 300 locations worldwide, your scrapers run close to your targets, reducing latency and improving reliability.

Global Edge Network

Deploy scrapers to 300+ edge locations worldwide. Execute requests from locations close to your target servers for optimal performance.

Browser Rendering

Cloudflare Browser Rendering Service (based on Puppeteer) handles JavaScript-heavy sites at the edge.

Serverless Architecture

No server management. Auto-scaling handles traffic spikes automatically. Pay only for what you use.

Integrated Storage

D1 database, R2 object storage, and KV storage are built-in. Store scraped data without external dependencies.

Architecture Overview

A typical Cloudflare Workers scraping architecture consists of several components working together:

HTTP Handler Worker

Receives scraping requests, validates authentication, and queues jobs for processing.

Browser Rendering Service

Renders JavaScript-heavy pages using headless Chrome, returning the final HTML.

Extraction Worker

Processes rendered HTML and uses AI to extract structured data.

Storage Layer

Stores extracted data in D1, R2, or KV. Maintains job status and audit trails.

Webhook Delivery

Notifies your application when extraction is complete via signed webhooks.

Cloudflare Browser Rendering Service

The Browser Rendering Service (BRS) is Cloudflare's solution for running headless Chrome at the edge. It'\''s essential for scraping modern websites that rely heavily on JavaScript.

Using Browser Rendering in a Worker

import { onRequest } from '@cloudflare/puppeteer';

export async function onRequest(context) {
  const { env } = context;

  // Connect to Browser Rendering Service
  const browser = await env.BROWSER.connect();
  const page = await browser.newPage();

  try {
    // Navigate to the target page
    await page.goto('https://example.com/product', {
      waitUntil: 'networkidle'
    });

    // Wait for specific content to load
    await page.waitForSelector('.product-details');

    // Extract data
    const data = await page.evaluate(() => {
      return {
        title: document.querySelector('h1')?.textContent,
        price: document.querySelector('.price')?.textContent,
        description: document.querySelector('.description')?.textContent
      };
    });

    return Response.json(data);
  } finally {
    await browser.disconnect();
  }
}

Browser Rendering Limits

BRS has usage limits including concurrent connections and execution time. For high-volume scraping, implement a queue system and process jobs asynchronously.

Building a Scraper with AI Extraction

Combining Browser Rendering with AI extraction creates a powerful scraping pipeline. Here'\''s how to build it:

Worker Implementation

interface Env {
  BROWSER: Fetcher;
  AI: AiTextGeneration;
  DB: D1Database;
  API_KEY: string;
}

interface ScrapingRequest {
  url: string;
  fields: string[];
  instructions?: string;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Verify API key
    const apiKey = request.headers.get('x-api-key');
    if (apiKey !== env.API_KEY) {
      return new Response('Unauthorized', { status: 401 });
    }

    // Parse request
    const { url, fields, instructions }: ScrapingRequest = await request.json();

    // Render page with browser
    const browser = await env.BROWSER.connect();
    const page = await browser.newPage();

    try {
      await page.goto(url, { waitUntil: 'domcontentloaded' });
      await page.waitForTimeout(2000); // Allow JS to execute

      // Get page content
      const content = await page.evaluate(() => {
        return document.body.innerText;
      });

      // Extract using AI
      const prompt = `Extract the following fields from this page content: ${fields.join(', ')}.
${instructions ? instructions : ''}
Return the result as JSON.

Page content:
${content.slice(0, 10000)}`; // Token limit

      const aiResponse = await env.AI.run(prompt, {
        type: 'json'
      });

      // Store result
      await env.DB.prepare(
        'INSERT INTO scraping_jobs (url, result, created_at) VALUES (?, ?, ?)'
      ).bind(url, JSON.stringify(aiResponse), Date.now()).run();

      return Response.json(aiResponse);
    } finally {
      await browser.disconnect();
    }
  }
};

Storage Options for Scraped Data

Cloudflare offers multiple storage options, each suited for different use cases:

D1 Database

SQLite-based database ideal for structured data. Perfect for storing scraped products, articles, or business listings.

// Store scraped data in D1
await env.DB.prepare(`
  INSERT INTO products (title, price, url, scraped_at)
  VALUES (?, ?, ?, ?)
`).bind(
  data.title,
  data.price,
  url,
  Date.now()
).run();

// Query data
const results = await env.DB.prepare(
  'SELECT * FROM products WHERE price < ? ORDER BY price DESC'
).bind(100).all();

R2 Object Storage

S3-compatible object storage. Best for storing raw HTML, screenshots, or large volumes of scraped content.

// Store raw HTML for later processing
const htmlKey = 'scrapes/' + date + '/' + encodeURIComponent(targetUrl) + '.html';
await env.R2.put(htmlKey, htmlContent, {
  customMetadata: {
    url: targetUrl,
    scrapedAt: new Date().toISOString(),
    status: 'success'
  }
});

// Retrieve later
const stored = await env.R2.get(htmlKey);
const content = await stored.text();

KV Storage

Low-latency key-value store with global replication. Ideal for caching, rate limiting, and job status tracking.

// Track scraping job status
await env.JOBS.put(jobId, JSON.stringify({
  status: 'processing',
  startedAt: Date.now()
}), { metadataTtl: 3600 });

// Implement rate limiting
const key = `ratelimit:${apiKey}`;
const count = await env.RATE_LIMIT.get(key);
if (count && parseInt(count) > 100) {
  return new Response('Rate limit exceeded', { status: 429 });
}
await env.RATE_LIMIT.put(key, (parseInt(count || '0') + 1).toString(), {
  expirationTtl: 60
});

Implementing Scheduled Scraping

Use Cloudflare Cron Triggers to run scrapers on a schedule. Perfect for periodic data collection like price monitoring or news aggregation.

// wrangler.toml
// [triggers]
// crons = ["0 * * * *"] // Every hour

export interface Env {
  BROWSER: Fetcher;
  AI: AiTextGeneration;
  R2: R2Bucket;
}

export default {
  async scheduled(event: ScheduledEvent, env: Env): Promise<void> {
    const urls = [
      'https://example.com/products/1',
      'https://example.com/products/2',
      'https://example.com/products/3'
    ];

    for (const url of urls) {
      try {
        const data = await scrapeUrl(url, env);

        // Store with timestamp
        await env.R2.put(
          `products/${Date.now()}/${encodeURIComponent(url)}.json`,
          JSON.stringify(data)
        );

        // Optional: Send webhook
        await fetch('https://your-app.com/webhook', {
          method: 'POST',
          body: JSON.stringify({ url, data })
        });
      } catch (error) {
        console.error(`Failed to scrape ${url}:`, error);
      }
    }
  }
};

async function scrapeUrl(url: string, env: Env): Promise<any> {
  const browser = await env.BROWSER.connect();
  const page = await browser.newPage();

  try {
    await page.goto(url, { waitUntil: 'networkidle' });
    const content = await page.evaluate(() => document.body.innerText);

    // Use AI to extract structured data
    const result = await env.AI.run(
      `Extract product title, price, and availability from: ${content.slice(0, 5000)}`,
      { type: 'json' }
    );

    return result;
  } finally {
    await browser.disconnect();
  }
}

Best Practices for Production Scrapers

1. Implement Retry Logic

Network requests fail. Browser rendering times out. Implement exponential backoff for resilience.

async function retryWithBackoff<T>(
    fn: () => Promise<T>,
    maxRetries = 3
  ): Promise<T> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      const delay = Math.pow(2, i) * 1000;
      await new Promise(r => setTimeout(r, delay));
    }
  }
  throw new Error('Max retries exceeded');
}

2. Use Queues for High Volume

For large-scale scraping, use Cloudflare Queues to decouple request ingestion from processing.

3. Monitor and Alert

Integrate with monitoring services to track success rates, response times, and error patterns.

4. Respect Rate Limits

Implement delays between requests and respect target servers. Use KV for distributed rate limiting.

5. Handle Authentication

For sites requiring authentication, pass cookies and headers through the browser context.

await page.setCookie({
  name: 'session',
  value: sessionToken,
  domain: 'example.com'
});

await page.setExtraHTTPHeaders({
  'Authorization': `Bearer ${token}`
});

Using RobotScraping.com Instead

Building and maintaining your own Cloudflare Workers scraping infrastructure is complex. RobotScraping.com handles all the complexity for you:

-No infrastructure to manage: We handle browser rendering, AI extraction, and storage
-Built-in proxy rotation: Residential and datacenter proxies included
-Scheduled scraping: Built-in cron with webhook delivery
-Durable storage: Your data is stored with full audit trails
-Simple API: One endpoint for all your scraping needs

// Compare: RobotScraping.com API (much simpler)
const response = await fetch('https://api.robotscraping.com/extract', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'x-api-key': process.env.API_KEY
  },
  body: JSON.stringify({
    url: 'https://example.com/product',
    fields: ['title', 'price', 'availability']
  })
});

const data = await response.json();
// Done! No Workers to deploy, no infrastructure to manage.

Start Scraping with No Infrastructure

Related Guides

AI-Powered Data Extraction

Learn how AI revolutionizes web scraping.

Web Scraping Fundamentals

Master scraping basics and best practices.