News Data Connectors¶
OpenML Crawler provides comprehensive news data connectors that access real-time and historical news articles from various news sources and APIs. These connectors support content analysis, sentiment extraction, and metadata processing for news data.
Supported News Sources¶
NewsAPI¶
Global news aggregation service with access to thousands of news sources worldwide.
Features:
- Real-time news articles
- Historical news search
- Source categorization
- Geographic filtering
- Language support
- Article metadata
- Image and multimedia content
Configuration:
connectors:
news:
newsapi:
api_key: "${NEWSAPI_API_KEY}"
language: "en"
country: "us"
category: "technology"
page_size: 100
Usage:
from openmlcrawler.connectors.news import NewsAPIConnector
connector = NewsAPIConnector(api_key="your_key")
# Get top headlines
headlines = connector.get_top_headlines(
country="us",
category="technology",
page_size=50
)
# Search news articles
articles = connector.search_articles(
query="artificial intelligence",
from_date="2023-01-01",
to_date="2023-12-31",
sort_by="relevancy"
)
# Get news sources
sources = connector.get_sources(
category="technology",
language="en",
country="us"
)
Google News¶
Google's news aggregation service with comprehensive global news coverage.
Features:
- Google News search
- Trending topics
- Regional news
- Topic clustering
- Article snippets
- Source credibility scores
- Real-time updates
Usage:
from openmlcrawler.connectors.news import GoogleNewsConnector
connector = GoogleNewsConnector()
# Search Google News
results = connector.search_news(
query="machine learning",
time_range="1d", # 1h, 1d, 7d, 1y
region="US",
language="en"
)
# Get trending topics
trending = connector.get_trending_topics(
region="US",
category="technology"
)
# Get topic clusters
clusters = connector.get_topic_clusters(
topic="artificial intelligence"
)
Bing News Search¶
Microsoft's news search service with comprehensive news aggregation and search capabilities.
Features:
- Advanced search operators
- News freshness filtering
- Source diversity
- Article categorization
- Entity extraction
- Sentiment analysis
- Multimedia content
Configuration:
connectors:
news:
bing_news:
api_key: "${BING_NEWS_API_KEY}"
market: "en-US"
safe_search: "Moderate"
count: 50
Usage:
from openmlcrawler.connectors.news import BingNewsConnector
connector = BingNewsConnector(api_key="your_key")
# Search news with advanced filters
results = connector.search_news(
query="renewable energy",
freshness="Week", # Day, Week, Month
count=100,
market="en-US"
)
# Get news by category
category_news = connector.get_news_by_category(
category="ScienceAndTechnology",
count=50
)
# Search with entity extraction
entity_results = connector.search_with_entities(
query="Tesla electric vehicles",
extract_entities=True
)
NY Times API¶
Premium news content from The New York Times with in-depth articles and archives.
Features:
- New York Times articles
- Archive access (1851-present)
- Article search and filtering
- Bestseller lists
- Movie reviews
- Real estate data
- Geographic coverage
Usage:
from openmlcrawler.connectors.news import NYTimesConnector
connector = NYTimesConnector(api_key="your_key")
# Search articles
articles = connector.search_articles(
query="artificial intelligence",
begin_date="20230101",
end_date="20231231",
sort="newest"
)
# Get top stories
top_stories = connector.get_top_stories(section="technology")
# Access archives
archive = connector.get_archive(
year=2023,
month=6
)
# Get bestseller lists
bestsellers = connector.get_bestseller_list(
list_name="hardcover-fiction",
date="current"
)
Data Types and Parameters¶
Article Metadata¶
Field | Description | Type | Sources |
---|---|---|---|
title | Article headline | string | All |
description | Article summary | string | All |
content | Full article text | string | NewsAPI, NY Times |
url | Article URL | string | All |
source | Publication source | object | All |
author | Article author | string | Most |
published_at | Publication date | datetime | All |
category | Article category | string | Most |
tags | Article tags/keywords | array | Most |
image_url | Featured image | string | Most |
Content Analysis¶
- Sentiment Analysis: Positive/negative/neutral sentiment scores
- Entity Extraction: People, organizations, locations mentioned
- Topic Classification: Automatic categorization
- Language Detection: Article language identification
- Readability Metrics: Flesch-Kincaid scores, complexity analysis
- Keyword Extraction: Important terms and phrases
Multimedia Content¶
- Images: Article thumbnails and featured images
- Videos: Embedded video content
- Audio: Podcast and audio content
- Infographics: Charts and data visualizations
- Social Media: Related social media posts
Data Collection Strategies¶
Real-time News Streaming¶
from openmlcrawler.connectors.news import NewsStreamer
streamer = NewsStreamer()
# Stream breaking news
streamer.stream_breaking_news(
keywords=["breaking", "urgent", "alert"],
sources=["newsapi", "google_news"],
callback=process_breaking_news
)
# Stream topic-specific news
streamer.stream_topic_news(
topics=["politics", "technology", "sports"],
languages=["en", "es", "fr"],
callback=process_topic_news
)
# Stream geographic news
streamer.stream_geographic_news(
locations=["New York", "London", "Tokyo"],
radius_km=100,
callback=process_local_news
)
Historical News Collection¶
from openmlcrawler.connectors.news import HistoricalNewsCollector
collector = HistoricalNewsCollector()
# Collect news by date range
news_data = collector.collect_news_history(
query="climate change",
start_date="2020-01-01",
end_date="2023-12-31",
sources=["newsapi", "nytimes"],
max_articles=10000
)
# Collect news by topic
topic_news = collector.collect_topic_history(
topics=["artificial intelligence", "machine learning"],
date_range=("2023-01-01", "2023-12-31"),
include_sentiment=True
)
# Collect news by source
source_news = collector.collect_source_history(
sources=["nytimes.com", "bbc.com", "cnn.com"],
date_range=("2023-01-01", "2023-12-31")
)
Batch Processing¶
from openmlcrawler.connectors.news import BatchNewsProcessor
processor = BatchNewsProcessor()
# Process multiple news sources
results = processor.process_batch(
queries=["AI", "machine learning", "data science"],
sources=["newsapi", "google_news", "bing_news"],
date_range=("2023-01-01", "2023-12-31"),
output_format="parquet"
)
Content Analysis and Processing¶
Sentiment Analysis¶
from openmlcrawler.connectors.news import NewsSentimentAnalyzer
analyzer = NewsSentimentAnalyzer()
# Analyze article sentiment
sentiment_results = analyzer.analyze_sentiment(
articles=news_articles,
model="vader", # vader, textblob, transformers
include_confidence=True
)
# Get sentiment trends
trends = analyzer.get_sentiment_trends(
sentiment_data=sentiment_results,
time_window="daily",
group_by="source"
)
# Detect sentiment shifts
shifts = analyzer.detect_sentiment_shifts(
data=sentiment_results,
threshold=0.2,
window_days=7
)
Topic Modeling and Classification¶
from openmlcrawler.connectors.news import NewsTopicAnalyzer
topic_analyzer = NewsTopicAnalyzer()
# Classify articles by topic
classified_articles = topic_analyzer.classify_articles(
articles=news_articles,
categories=[
"politics", "technology", "business", "sports",
"health", "entertainment", "science", "world"
]
)
# Extract topics using LDA
topics = topic_analyzer.extract_topics_lda(
articles=news_articles,
num_topics=10,
num_words=10
)
# Find related articles
related = topic_analyzer.find_related_articles(
target_article=article,
candidate_articles=news_articles,
similarity_threshold=0.7
)
Entity Extraction¶
from openmlcrawler.connectors.news import NewsEntityExtractor
extractor = NewsEntityExtractor()
# Extract named entities
entities = extractor.extract_entities(
articles=news_articles,
entity_types=["PERSON", "ORG", "GPE", "MONEY", "DATE"],
include_context=True
)
# Build entity networks
network = extractor.build_entity_network(
entities=entities,
relationship_types=["mentioned_with", "quoted_by"]
)
# Track entity mentions over time
timeline = extractor.track_entity_mentions(
entity="artificial intelligence",
articles=news_articles,
time_window="monthly"
)
Data Quality and Validation¶
Quality Assurance¶
- Source Credibility: Verify news source reliability
- Content Freshness: Check article publication dates
- Duplicate Detection: Identify duplicate articles
- Spam Filtering: Remove low-quality or spam content
- Fact-Checking: Cross-reference with reliable sources
Validation Framework¶
from openmlcrawler.connectors.news import NewsDataValidator
validator = NewsDataValidator()
# Validate news articles
validation_result = validator.validate_articles(
articles=news_articles,
checks=[
"source_credibility",
"content_freshness",
"duplicate_detection",
"spam_filtering",
"fact_checking"
]
)
# Generate quality report
quality_report = validator.generate_quality_report(
validation_results=validation_result,
include_recommendations=True
)
Integration with ML Pipelines¶
News Classification Pipeline¶
from openmlcrawler.connectors.news import NewsClassificationPipeline
pipeline = NewsClassificationPipeline()
# Classify news articles
classified_news = pipeline.classify_news(
articles=raw_articles,
models=[
"sentiment_classifier",
"topic_classifier",
"fake_news_detector"
]
)
# Generate insights
insights = pipeline.generate_insights(
classified_news=classified_news,
analysis_types=[
"sentiment_trends",
"topic_distribution",
"source_reliability"
]
)
Trend Analysis¶
from openmlcrawler.connectors.news import NewsTrendAnalyzer
trend_analyzer = NewsTrendAnalyzer()
# Analyze news trends
trends = trend_analyzer.analyze_trends(
news_data=news_articles,
time_period="30d",
metrics=[
"article_volume",
"sentiment_trends",
"topic_popularity",
"source_coverage"
]
)
# Detect breaking news
breaking_news = trend_analyzer.detect_breaking_news(
recent_articles=recent_articles,
baseline_articles=baseline_articles,
threshold=2.0 # 2x normal volume
)
Configuration Options¶
Global Configuration¶
news_connectors:
default_sources: ["newsapi", "google_news"]
content_analysis:
enable_sentiment: true
enable_entity_extraction: true
enable_topic_modeling: true
data_quality:
enable_validation: true
credibility_threshold: 0.7
freshness_hours: 24
caching:
enable_cache: true
cache_ttl_minutes: 30
max_cache_size_gb: 20
rate_limiting:
requests_per_minute: 10
burst_limit: 20
Source-Specific Settings¶
newsapi:
api_key: "${NEWSAPI_API_KEY}"
language: "en"
country: "us"
page_size: 100
google_news:
user_agent: "OpenMLCrawler/1.0"
region: "US"
language: "en"
timeout_seconds: 30
bing_news:
api_key: "${BING_NEWS_API_KEY}"
market: "en-US"
safe_search: "Moderate"
count: 50
nytimes:
api_key: "${NYTIMES_API_KEY}"
timeout_seconds: 30
retry_attempts: 3
Best Practices¶
Performance Optimization¶
- Use Caching: News data can be cached for short periods
- Batch Requests: Combine multiple queries when possible
- Selective Filtering: Use specific search criteria to reduce data volume
- Rate Limiting: Respect API rate limits and implement backoff
- Content Filtering: Filter out irrelevant or low-quality content early
Cost Management¶
- API Tier Selection: Choose appropriate API tiers for your needs
- Usage Monitoring: Track API usage and costs
- Query Optimization: Use specific queries to minimize API calls
- Data Sampling: Sample news data for analysis instead of full collection
- Caching Strategy: Implement intelligent caching for frequently accessed data
Data Reliability¶
- Multiple Sources: Cross-validate news from multiple sources
- Source Credibility: Use credibility scoring for source validation
- Fact-Checking: Implement fact-checking mechanisms
- Freshness Monitoring: Monitor news freshness and timeliness
- Bias Detection: Detect potential bias in news coverage
Troubleshooting¶
Common Issues¶
API Rate Limiting¶
Content Filtering¶
Source Unavailable¶
Error: News source temporarily unavailable
Solution: Use alternative sources or implement fallback strategies
Content Parsing¶
See Also¶
- Connectors Overview - Overview of all data connectors
- Data Processing - Processing news data
- Quality & Privacy - News data quality controls
- API Reference - News connector API
- Tutorials - News data tutorials