Use Cases
Building a News Aggregator: Web Scraping Tutorial
Create automated news aggregation systems that extract articles, headlines, and metadata from multiple sources. Learn to build a comprehensive media monitoring solution.
Why Build a News Aggregator?
News aggregators have become essential tools for staying informed in an age of information overload. By consolidating content from multiple sources, they provide comprehensive coverage of topics that matter to you or your organization.
Brand Monitoring
Track mentions of your brand across news outlets and blogs. Respond to stories quickly and manage reputation.
Competitive Intelligence
Monitor competitor news, product launches, and strategic moves. Stay ahead of industry developments.
Industry Research
Aggregate news from industry publications to identify trends, opportunities, and emerging technologies.
Content Curation
Build niche news platforms focused on specific topics, regions, or industries. Provide value through curated content.
News Aggregator Architecture
A production news aggregator consists of several key components:
Source Management
Maintain a database of news sources with their URLs, RSS feeds, and update frequencies.
Headline Extraction
Scrape homepage and category pages to extract headlines, URLs, summaries, and metadata.
Full Article Fetching
Fetch complete article content from linked pages, extracting body text, author, images, and publication date.
Content Processing
Clean HTML, extract main content, categorize articles, and perform sentiment analysis.
Deduplication
Identify duplicate stories across sources using similarity matching and URL clustering.
Delivery & Alerts
Deliver relevant content via email digests, web feeds, or real-time alerts for breaking news.
Extracting Headlines from News Sites
News sites have different layouts, but AI extraction handles this variability. Start by extracting headlines from homepage or category pages.
Basic Headline Extraction
const response = await fetch('https://api.robotscraping.com/extract', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.API_KEY
},
body: JSON.stringify({
url: 'https://news.example.com/technology',
fields: ['headlines'],
instructions: 'Extract all news article headlines from this page. For each headline, return the title, the article URL, the publication date if shown, a brief summary or excerpt, and the category or section name. Return as an array of objects.'
})
});
const data = await response.json();
// {
// headlines: [
// {
// title: "Tech Giant Announces New AI Features",
// url: "https://news.example.com/article1",
// date: "January 15, 2025",
// summary: "The company revealed...",
// category: "Technology"
// },
// {
// title: "Startup Raises $50M Series B",
// url: "https://news.example.com/article2",
// date: "January 15, 2025",
// summary: "The funding will be used...",
// category: "Business"
// },
// // ... more headlines
// ]
// }Handling Infinite Scroll
Many news sites use infinite scroll. Configure the scraper to wait for content to load or use pagination parameters.
body: JSON.stringify({
url: 'https://news.example.com/technology',
fields: ['headlines'],
options: {
waitUntil: 'networkidle',
timeoutMs: 30000,
waitForSelector: '.article-list'
}
})Extracting Full Article Content
Once you have headlines, fetch full articles for complete content. AI extraction intelligently identifies the main article content, excluding sidebars, ads, and navigation.
Article Extraction
const response = await fetch('https://api.robotscraping.com/extract', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.API_KEY
},
body: JSON.stringify({
url: 'https://news.example.com/article/tech-giant-ai',
fields: [
'article_title',
'author',
'publication_date',
'last_updated',
'article_body',
'summary',
'tags',
'category',
'image_url',
'image_caption'
],
instructions: 'Extract the main article content. Get the headline, byline with author name, publication and update dates, the full article text body (not including ads or sidebars), a brief summary, any topic tags or categories, and the main image with its caption.'
})
});
const article = await response.json();
// {
// article_title: "Tech Giant Announces New AI Features",
// author: "Jane Smith",
// publication_date: "2025-01-15",
// last_updated: "2025-01-15 14:30",
// article_body: "In a groundbreaking announcement...",
// summary: "The company revealed new AI capabilities...",
// tags: ["AI", "Technology", "Product Launch"],
// category: "Technology",
// image_url: "https://news.example.com/images/article1.jpg",
// image_caption: "The new AI interface in action."
// }Batch Article Processing
Use the batch endpoint to process multiple article URLs simultaneously.
const articleUrls = [
'https://news.example.com/article1',
'https://news.example.com/article2',
'https://news.example.com/article3'
];
const response = await fetch('https://api.robotscraping.com/batch', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.API_KEY
},
body: JSON.stringify({
urls: articleUrls,
fields: ['article_title', 'article_body', 'author', 'publication_date'],
async: true,
webhook_url: 'https://your-app.com/webhook/articles'
})
});Managing Multiple News Sources
A robust aggregator monitors multiple sources simultaneously. Here's how to structure your source management:
Source Database Schema
-- News sources configuration CREATE TABLE news_sources ( id INTEGER PRIMARY KEY, name TEXT NOT NULL, base_url TEXT NOT NULL, category_pages TEXT, -- JSON array of category URLs rss_feeds TEXT, -- JSON array of RSS feed URLs update_frequency INTEGER, -- Minutes between checks is_active BOOLEAN, last_crawled_at TIMESTAMP ); -- Extracted articles CREATE TABLE articles ( id INTEGER PRIMARY KEY, source_id INTEGER, url TEXT UNIQUE, title TEXT, author TEXT, publication_date TIMESTAMP, article_body TEXT, summary TEXT, category TEXT, tags TEXT, -- JSON array image_url TEXT, scraped_at TIMESTAMP, FOREIGN KEY (source_id) REFERENCES news_sources(id) ); -- Article deduplication tracking CREATE TABLE article_clusters ( id INTEGER PRIMARY KEY, canonical_article_id INTEGER, cluster_signature TEXT, article_ids TEXT, -- JSON array of related article IDs created_at TIMESTAMP );
Source Management Code
// Get sources that need to be crawled
async function getSourcesForCrawling(db) {
const sources = await db.prepare(`
SELECT * FROM news_sources
WHERE is_active = true
AND (
last_crawled_at IS NULL
OR datetime(last_crawled_at, '+' || update_frequency || ' minutes') < datetime('now')
)
`).all();
return sources;
}
// Mark source as crawled
async function markSourceCrawled(db, sourceId) {
await db.prepare(`
UPDATE news_sources
SET last_crawled_at = datetime('now')
WHERE id = ?
`).bind(sourceId).run();
}
// Store extracted articles
async function storeArticles(db, sourceId, articles) {
for (const article of articles) {
await db.prepare(`
INSERT OR REPLACE INTO articles
(source_id, url, title, author, publication_date,
article_body, summary, category, tags, image_url, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
`).bind(
sourceId,
article.url,
article.title,
article.author,
article.publication_date,
article.article_body,
article.summary,
article.category,
JSON.stringify(article.tags || []),
article.image_url,
Date.now()
).run();
}
}Article Deduplication
The same story often appears across multiple outlets. Deduplication prevents showing duplicate content to users.
URL-Based Deduplication
// Simple URL normalization
function normalizeUrl(url) {
return url
.toLowerCase()
.replace(/^https?:///, '')
.replace(//$/, '')
.replace(/www./, '')
.replace(/#.*/, '');
}
// Check for duplicates before inserting
async function findDuplicateByUrl(db, url) {
const normalized = normalizeUrl(url);
return await db.prepare(`
SELECT id FROM articles WHERE normalize_url(url) = ?
`).bind(normalized).first();
}Content-Based Similarity
For stories with different URLs, use content similarity. This is more complex but catches syndicated content.
// Generate a simple content signature
function generateContentSignature(title, summary) {
const words = (title + ' ' + summary)
.toLowerCase()
.replace(/[^ws]/g, '')
.split(/s+/)
.filter(w => w.length > 3)
.sort();
// Create hash from sorted unique words
return crypto.createHash('md5')
.update(words.join(' '))
.digest('hex');
}
// Find similar articles
async function findSimilarArticles(db, title, summary) {
const signature = generateContentSignature(title, summary);
return await db.prepare(`
SELECT a.*, c.signature
FROM articles a
LEFT JOIN article_clusters c ON a.id = c.canonical_article_id
WHERE c.signature = ?
`).bind(signature).all();
}Setting Up Scheduled Monitoring
News moves fast. Set up schedules to monitor sources frequently and receive content via webhooks.
// Create a monitoring schedule for each source
const sources = [
{ name: 'TechCrunch', url: 'https://techcrunch.com', categoryUrl: 'https://techcrunch.com/category/artificial-intelligence/' },
{ name: 'The Verge', url: 'https://theverge.com', categoryUrl: 'https://theverge.com/tech' },
{ name: 'Ars Technica', url: 'https://arstechnica.com', categoryUrl: 'https://arstechnica.com/gadgets/' }
];
for (const source of sources) {
await fetch('https://api.robotscraping.com/schedules', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.API_KEY
},
body: JSON.stringify({
name: 'news-' + source.name.toLowerCase(),
cron: '0 */2 * * *', // Every 2 hours
urls: [source.categoryUrl],
config: {
fields: ['headlines'],
instructions: 'Extract headlines from ' + source.name + '. Return title, URL, date, and summary.',
options: {
waitUntil: 'domcontentloaded'
}
},
webhook_url: 'https://your-app.com/webhook/news/' + source.name
})
});
}Webhook Handler
app.post('/webhook/news/:source', async (req, res) => {
const { source } = req.params;
const { result } = req.body;
// Verify webhook signature
const signature = req.headers['x-webhook-signature'];
if (!verifySignature(signature, req.body, WEBHOOK_SECRET)) {
return res.status(401).send('Invalid signature');
}
// Get source ID
const sourceRecord = await getSourceByName(source);
if (!sourceRecord) {
return res.status(404).send('Source not found');
}
// Process headlines
const { headlines } = result;
for (const headline of headlines) {
// Check for duplicates
const existing = await findDuplicateByUrl(db, headline.url);
if (existing) continue;
// Queue full article fetch
await queueArticleFetch(sourceRecord.id, headline.url);
}
res.sendStatus(200);
});Legal and Ethical Considerations
News aggregation has specific legal considerations that differ from other types of scraping:
- -Copyright: Full article content may be protected. Consider using headlines, summaries, and linking to original sources rather than reproducing entire articles.
- -Terms of Service: Many news sites explicitly prohibit scraping in their ToS. Review and respect these terms.
- -Attribution: Always clearly attribute content to the original source. Link back to the original article.
- -Paywalls: Respect paywall restrictions. Don't circumvent access controls for gated content.
- -Robots.txt: Check and respect robots.txt directives for each source.
- -Hotlinking: Don't hotlink images from news sites. Download and host images yourself or use thumbnails that link to the original.
Best Practice: Fair Use
Aggregate headlines and brief excerpts (typically 1-2 sentences) with clear attribution and links to the original source. This approach is more likely to qualify as fair use and provides value by organizing and filtering content rather than reproducing it.
Using RobotScraping.com for News Aggregation
RobotScraping.com simplifies news aggregation with AI-powered extraction that handles different site layouts automatically.
// Complete news aggregation workflow
const newsSources = [
'https://techcrunch.com/feed/',
'https://www.theverge.com/rss/index.xml',
'https://arstechnica.com/feed/'
];
// Create schedules for each source
for (const source of newsSources) {
await fetch('https://api.robotscraping.com/schedules', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.API_KEY
},
body: JSON.stringify({
name: 'news-' + new URL(source).hostname,
cron: '0 */1 * * *', // Every hour
urls: [source],
config: {
fields: ['title', 'link', 'description', 'pubDate', 'creator', 'categories'],
options: {
timeoutMs: 15000
}
},
webhook_url: 'https://your-app.com/webhook/news'
})
});
}