Skip to main content
Back to Guides

Use Cases

Building a News Aggregator: Web Scraping Tutorial

Create automated news aggregation systems that extract articles, headlines, and metadata from multiple sources. Learn to build a comprehensive media monitoring solution.

16 min readUpdated January 15, 2025

Why Build a News Aggregator?

News aggregators have become essential tools for staying informed in an age of information overload. By consolidating content from multiple sources, they provide comprehensive coverage of topics that matter to you or your organization.

Brand Monitoring

Track mentions of your brand across news outlets and blogs. Respond to stories quickly and manage reputation.

Competitive Intelligence

Monitor competitor news, product launches, and strategic moves. Stay ahead of industry developments.

Industry Research

Aggregate news from industry publications to identify trends, opportunities, and emerging technologies.

Content Curation

Build niche news platforms focused on specific topics, regions, or industries. Provide value through curated content.

News Aggregator Architecture

A production news aggregator consists of several key components:

1

Source Management

Maintain a database of news sources with their URLs, RSS feeds, and update frequencies.

2

Headline Extraction

Scrape homepage and category pages to extract headlines, URLs, summaries, and metadata.

3

Full Article Fetching

Fetch complete article content from linked pages, extracting body text, author, images, and publication date.

4

Content Processing

Clean HTML, extract main content, categorize articles, and perform sentiment analysis.

5

Deduplication

Identify duplicate stories across sources using similarity matching and URL clustering.

6

Delivery & Alerts

Deliver relevant content via email digests, web feeds, or real-time alerts for breaking news.

Extracting Headlines from News Sites

News sites have different layouts, but AI extraction handles this variability. Start by extracting headlines from homepage or category pages.

Basic Headline Extraction

const response = await fetch('https://api.robotscraping.com/extract', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'x-api-key': process.env.API_KEY
  },
  body: JSON.stringify({
    url: 'https://news.example.com/technology',
    fields: ['headlines'],
    instructions: 'Extract all news article headlines from this page. For each headline, return the title, the article URL, the publication date if shown, a brief summary or excerpt, and the category or section name. Return as an array of objects.'
  })
});

const data = await response.json();
// {
//   headlines: [
//     {
//       title: "Tech Giant Announces New AI Features",
//       url: "https://news.example.com/article1",
//       date: "January 15, 2025",
//       summary: "The company revealed...",
//       category: "Technology"
//     },
//     {
//       title: "Startup Raises $50M Series B",
//       url: "https://news.example.com/article2",
//       date: "January 15, 2025",
//       summary: "The funding will be used...",
//       category: "Business"
//     },
//     // ... more headlines
//   ]
// }

Handling Infinite Scroll

Many news sites use infinite scroll. Configure the scraper to wait for content to load or use pagination parameters.

body: JSON.stringify({
  url: 'https://news.example.com/technology',
  fields: ['headlines'],
  options: {
    waitUntil: 'networkidle',
    timeoutMs: 30000,
    waitForSelector: '.article-list'
  }
})

Extracting Full Article Content

Once you have headlines, fetch full articles for complete content. AI extraction intelligently identifies the main article content, excluding sidebars, ads, and navigation.

Article Extraction

const response = await fetch('https://api.robotscraping.com/extract', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'x-api-key': process.env.API_KEY
  },
  body: JSON.stringify({
    url: 'https://news.example.com/article/tech-giant-ai',
    fields: [
      'article_title',
      'author',
      'publication_date',
      'last_updated',
      'article_body',
      'summary',
      'tags',
      'category',
      'image_url',
      'image_caption'
    ],
    instructions: 'Extract the main article content. Get the headline, byline with author name, publication and update dates, the full article text body (not including ads or sidebars), a brief summary, any topic tags or categories, and the main image with its caption.'
  })
});

const article = await response.json();
// {
//   article_title: "Tech Giant Announces New AI Features",
//   author: "Jane Smith",
//   publication_date: "2025-01-15",
//   last_updated: "2025-01-15 14:30",
//   article_body: "In a groundbreaking announcement...",
//   summary: "The company revealed new AI capabilities...",
//   tags: ["AI", "Technology", "Product Launch"],
//   category: "Technology",
//   image_url: "https://news.example.com/images/article1.jpg",
//   image_caption: "The new AI interface in action."
// }

Batch Article Processing

Use the batch endpoint to process multiple article URLs simultaneously.

const articleUrls = [
  'https://news.example.com/article1',
  'https://news.example.com/article2',
  'https://news.example.com/article3'
];

const response = await fetch('https://api.robotscraping.com/batch', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'x-api-key': process.env.API_KEY
  },
  body: JSON.stringify({
    urls: articleUrls,
    fields: ['article_title', 'article_body', 'author', 'publication_date'],
    async: true,
    webhook_url: 'https://your-app.com/webhook/articles'
  })
});

Managing Multiple News Sources

A robust aggregator monitors multiple sources simultaneously. Here's how to structure your source management:

Source Database Schema

-- News sources configuration
CREATE TABLE news_sources (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  base_url TEXT NOT NULL,
  category_pages TEXT, -- JSON array of category URLs
  rss_feeds TEXT, -- JSON array of RSS feed URLs
  update_frequency INTEGER, -- Minutes between checks
  is_active BOOLEAN,
  last_crawled_at TIMESTAMP
);

-- Extracted articles
CREATE TABLE articles (
  id INTEGER PRIMARY KEY,
  source_id INTEGER,
  url TEXT UNIQUE,
  title TEXT,
  author TEXT,
  publication_date TIMESTAMP,
  article_body TEXT,
  summary TEXT,
  category TEXT,
  tags TEXT, -- JSON array
  image_url TEXT,
  scraped_at TIMESTAMP,
  FOREIGN KEY (source_id) REFERENCES news_sources(id)
);

-- Article deduplication tracking
CREATE TABLE article_clusters (
  id INTEGER PRIMARY KEY,
  canonical_article_id INTEGER,
  cluster_signature TEXT,
  article_ids TEXT, -- JSON array of related article IDs
  created_at TIMESTAMP
);

Source Management Code

// Get sources that need to be crawled
async function getSourcesForCrawling(db) {
  const sources = await db.prepare(`
    SELECT * FROM news_sources
    WHERE is_active = true
      AND (
        last_crawled_at IS NULL
        OR datetime(last_crawled_at, '+' || update_frequency || ' minutes') < datetime('now')
      )
  `).all();
  return sources;
}

// Mark source as crawled
async function markSourceCrawled(db, sourceId) {
  await db.prepare(`
    UPDATE news_sources
    SET last_crawled_at = datetime('now')
    WHERE id = ?
  `).bind(sourceId).run();
}

// Store extracted articles
async function storeArticles(db, sourceId, articles) {
  for (const article of articles) {
    await db.prepare(`
      INSERT OR REPLACE INTO articles
      (source_id, url, title, author, publication_date,
       article_body, summary, category, tags, image_url, scraped_at)
      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    `).bind(
      sourceId,
      article.url,
      article.title,
      article.author,
      article.publication_date,
      article.article_body,
      article.summary,
      article.category,
      JSON.stringify(article.tags || []),
      article.image_url,
      Date.now()
    ).run();
  }
}

Article Deduplication

The same story often appears across multiple outlets. Deduplication prevents showing duplicate content to users.

URL-Based Deduplication

// Simple URL normalization
function normalizeUrl(url) {
  return url
    .toLowerCase()
    .replace(/^https?:///, '')
    .replace(//$/, '')
    .replace(/www./, '')
    .replace(/#.*/, '');
}

// Check for duplicates before inserting
async function findDuplicateByUrl(db, url) {
  const normalized = normalizeUrl(url);
  return await db.prepare(`
    SELECT id FROM articles WHERE normalize_url(url) = ?
  `).bind(normalized).first();
}

Content-Based Similarity

For stories with different URLs, use content similarity. This is more complex but catches syndicated content.

// Generate a simple content signature
function generateContentSignature(title, summary) {
  const words = (title + ' ' + summary)
    .toLowerCase()
    .replace(/[^ws]/g, '')
    .split(/s+/)
    .filter(w => w.length > 3)
    .sort();

  // Create hash from sorted unique words
  return crypto.createHash('md5')
    .update(words.join(' '))
    .digest('hex');
}

// Find similar articles
async function findSimilarArticles(db, title, summary) {
  const signature = generateContentSignature(title, summary);
  return await db.prepare(`
    SELECT a.*, c.signature
    FROM articles a
    LEFT JOIN article_clusters c ON a.id = c.canonical_article_id
    WHERE c.signature = ?
  `).bind(signature).all();
}

Setting Up Scheduled Monitoring

News moves fast. Set up schedules to monitor sources frequently and receive content via webhooks.

// Create a monitoring schedule for each source
const sources = [
  { name: 'TechCrunch', url: 'https://techcrunch.com', categoryUrl: 'https://techcrunch.com/category/artificial-intelligence/' },
  { name: 'The Verge', url: 'https://theverge.com', categoryUrl: 'https://theverge.com/tech' },
  { name: 'Ars Technica', url: 'https://arstechnica.com', categoryUrl: 'https://arstechnica.com/gadgets/' }
];

for (const source of sources) {
  await fetch('https://api.robotscraping.com/schedules', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'x-api-key': process.env.API_KEY
    },
    body: JSON.stringify({
      name: 'news-' + source.name.toLowerCase(),
      cron: '0 */2 * * *', // Every 2 hours
      urls: [source.categoryUrl],
      config: {
        fields: ['headlines'],
        instructions: 'Extract headlines from ' + source.name + '. Return title, URL, date, and summary.',
        options: {
          waitUntil: 'domcontentloaded'
        }
      },
      webhook_url: 'https://your-app.com/webhook/news/' + source.name
    })
  });
}

Webhook Handler

app.post('/webhook/news/:source', async (req, res) => {
  const { source } = req.params;
  const { result } = req.body;

  // Verify webhook signature
  const signature = req.headers['x-webhook-signature'];
  if (!verifySignature(signature, req.body, WEBHOOK_SECRET)) {
    return res.status(401).send('Invalid signature');
  }

  // Get source ID
  const sourceRecord = await getSourceByName(source);
  if (!sourceRecord) {
    return res.status(404).send('Source not found');
  }

  // Process headlines
  const { headlines } = result;
  for (const headline of headlines) {
    // Check for duplicates
    const existing = await findDuplicateByUrl(db, headline.url);
    if (existing) continue;

    // Queue full article fetch
    await queueArticleFetch(sourceRecord.id, headline.url);
  }

  res.sendStatus(200);
});

Legal and Ethical Considerations

News aggregation has specific legal considerations that differ from other types of scraping:

  • -Copyright: Full article content may be protected. Consider using headlines, summaries, and linking to original sources rather than reproducing entire articles.
  • -Terms of Service: Many news sites explicitly prohibit scraping in their ToS. Review and respect these terms.
  • -Attribution: Always clearly attribute content to the original source. Link back to the original article.
  • -Paywalls: Respect paywall restrictions. Don't circumvent access controls for gated content.
  • -Robots.txt: Check and respect robots.txt directives for each source.
  • -Hotlinking: Don't hotlink images from news sites. Download and host images yourself or use thumbnails that link to the original.

Best Practice: Fair Use

Aggregate headlines and brief excerpts (typically 1-2 sentences) with clear attribution and links to the original source. This approach is more likely to qualify as fair use and provides value by organizing and filtering content rather than reproducing it.

Using RobotScraping.com for News Aggregation

RobotScraping.com simplifies news aggregation with AI-powered extraction that handles different site layouts automatically.

// Complete news aggregation workflow
const newsSources = [
  'https://techcrunch.com/feed/',
  'https://www.theverge.com/rss/index.xml',
  'https://arstechnica.com/feed/'
];

// Create schedules for each source
for (const source of newsSources) {
  await fetch('https://api.robotscraping.com/schedules', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'x-api-key': process.env.API_KEY
    },
    body: JSON.stringify({
      name: 'news-' + new URL(source).hostname,
      cron: '0 */1 * * *', // Every hour
      urls: [source],
      config: {
        fields: ['title', 'link', 'description', 'pubDate', 'creator', 'categories'],
        options: {
          timeoutMs: 15000
        }
      },
      webhook_url: 'https://your-app.com/webhook/news'
    })
  });
}

Related Guides