Skip to content

News Data Connectors

OpenML Crawler provides comprehensive news data connectors that access real-time and historical news articles from various news sources and APIs. These connectors support content analysis, sentiment extraction, and metadata processing for news data.

Supported News Sources

NewsAPI

Global news aggregation service with access to thousands of news sources worldwide.

Features:

  • Real-time news articles
  • Historical news search
  • Source categorization
  • Geographic filtering
  • Language support
  • Article metadata
  • Image and multimedia content

Configuration:

connectors:
  news:
    newsapi:
      api_key: "${NEWSAPI_API_KEY}"
      language: "en"
      country: "us"
      category: "technology"
      page_size: 100

Usage:

from openmlcrawler.connectors.news import NewsAPIConnector

connector = NewsAPIConnector(api_key="your_key")

# Get top headlines
headlines = connector.get_top_headlines(
    country="us",
    category="technology",
    page_size=50
)

# Search news articles
articles = connector.search_articles(
    query="artificial intelligence",
    from_date="2023-01-01",
    to_date="2023-12-31",
    sort_by="relevancy"
)

# Get news sources
sources = connector.get_sources(
    category="technology",
    language="en",
    country="us"
)

Google News

Google's news aggregation service with comprehensive global news coverage.

Features:

  • Google News search
  • Trending topics
  • Regional news
  • Topic clustering
  • Article snippets
  • Source credibility scores
  • Real-time updates

Usage:

from openmlcrawler.connectors.news import GoogleNewsConnector

connector = GoogleNewsConnector()

# Search Google News
results = connector.search_news(
    query="machine learning",
    time_range="1d",  # 1h, 1d, 7d, 1y
    region="US",
    language="en"
)

# Get trending topics
trending = connector.get_trending_topics(
    region="US",
    category="technology"
)

# Get topic clusters
clusters = connector.get_topic_clusters(
    topic="artificial intelligence"
)

Microsoft's news search service with comprehensive news aggregation and search capabilities.

Features:

  • Advanced search operators
  • News freshness filtering
  • Source diversity
  • Article categorization
  • Entity extraction
  • Sentiment analysis
  • Multimedia content

Configuration:

connectors:
  news:
    bing_news:
      api_key: "${BING_NEWS_API_KEY}"
      market: "en-US"
      safe_search: "Moderate"
      count: 50

Usage:

from openmlcrawler.connectors.news import BingNewsConnector

connector = BingNewsConnector(api_key="your_key")

# Search news with advanced filters
results = connector.search_news(
    query="renewable energy",
    freshness="Week",  # Day, Week, Month
    count=100,
    market="en-US"
)

# Get news by category
category_news = connector.get_news_by_category(
    category="ScienceAndTechnology",
    count=50
)

# Search with entity extraction
entity_results = connector.search_with_entities(
    query="Tesla electric vehicles",
    extract_entities=True
)

NY Times API

Premium news content from The New York Times with in-depth articles and archives.

Features:

  • New York Times articles
  • Archive access (1851-present)
  • Article search and filtering
  • Bestseller lists
  • Movie reviews
  • Real estate data
  • Geographic coverage

Usage:

from openmlcrawler.connectors.news import NYTimesConnector

connector = NYTimesConnector(api_key="your_key")

# Search articles
articles = connector.search_articles(
    query="artificial intelligence",
    begin_date="20230101",
    end_date="20231231",
    sort="newest"
)

# Get top stories
top_stories = connector.get_top_stories(section="technology")

# Access archives
archive = connector.get_archive(
    year=2023,
    month=6
)

# Get bestseller lists
bestsellers = connector.get_bestseller_list(
    list_name="hardcover-fiction",
    date="current"
)

Data Types and Parameters

Article Metadata

Field Description Type Sources
title Article headline string All
description Article summary string All
content Full article text string NewsAPI, NY Times
url Article URL string All
source Publication source object All
author Article author string Most
published_at Publication date datetime All
category Article category string Most
tags Article tags/keywords array Most
image_url Featured image string Most

Content Analysis

  • Sentiment Analysis: Positive/negative/neutral sentiment scores
  • Entity Extraction: People, organizations, locations mentioned
  • Topic Classification: Automatic categorization
  • Language Detection: Article language identification
  • Readability Metrics: Flesch-Kincaid scores, complexity analysis
  • Keyword Extraction: Important terms and phrases

Multimedia Content

  • Images: Article thumbnails and featured images
  • Videos: Embedded video content
  • Audio: Podcast and audio content
  • Infographics: Charts and data visualizations
  • Social Media: Related social media posts

Data Collection Strategies

Real-time News Streaming

from openmlcrawler.connectors.news import NewsStreamer

streamer = NewsStreamer()

# Stream breaking news
streamer.stream_breaking_news(
    keywords=["breaking", "urgent", "alert"],
    sources=["newsapi", "google_news"],
    callback=process_breaking_news
)

# Stream topic-specific news
streamer.stream_topic_news(
    topics=["politics", "technology", "sports"],
    languages=["en", "es", "fr"],
    callback=process_topic_news
)

# Stream geographic news
streamer.stream_geographic_news(
    locations=["New York", "London", "Tokyo"],
    radius_km=100,
    callback=process_local_news
)

Historical News Collection

from openmlcrawler.connectors.news import HistoricalNewsCollector

collector = HistoricalNewsCollector()

# Collect news by date range
news_data = collector.collect_news_history(
    query="climate change",
    start_date="2020-01-01",
    end_date="2023-12-31",
    sources=["newsapi", "nytimes"],
    max_articles=10000
)

# Collect news by topic
topic_news = collector.collect_topic_history(
    topics=["artificial intelligence", "machine learning"],
    date_range=("2023-01-01", "2023-12-31"),
    include_sentiment=True
)

# Collect news by source
source_news = collector.collect_source_history(
    sources=["nytimes.com", "bbc.com", "cnn.com"],
    date_range=("2023-01-01", "2023-12-31")
)

Batch Processing

from openmlcrawler.connectors.news import BatchNewsProcessor

processor = BatchNewsProcessor()

# Process multiple news sources
results = processor.process_batch(
    queries=["AI", "machine learning", "data science"],
    sources=["newsapi", "google_news", "bing_news"],
    date_range=("2023-01-01", "2023-12-31"),
    output_format="parquet"
)

Content Analysis and Processing

Sentiment Analysis

from openmlcrawler.connectors.news import NewsSentimentAnalyzer

analyzer = NewsSentimentAnalyzer()

# Analyze article sentiment
sentiment_results = analyzer.analyze_sentiment(
    articles=news_articles,
    model="vader",  # vader, textblob, transformers
    include_confidence=True
)

# Get sentiment trends
trends = analyzer.get_sentiment_trends(
    sentiment_data=sentiment_results,
    time_window="daily",
    group_by="source"
)

# Detect sentiment shifts
shifts = analyzer.detect_sentiment_shifts(
    data=sentiment_results,
    threshold=0.2,
    window_days=7
)

Topic Modeling and Classification

from openmlcrawler.connectors.news import NewsTopicAnalyzer

topic_analyzer = NewsTopicAnalyzer()

# Classify articles by topic
classified_articles = topic_analyzer.classify_articles(
    articles=news_articles,
    categories=[
        "politics", "technology", "business", "sports",
        "health", "entertainment", "science", "world"
    ]
)

# Extract topics using LDA
topics = topic_analyzer.extract_topics_lda(
    articles=news_articles,
    num_topics=10,
    num_words=10
)

# Find related articles
related = topic_analyzer.find_related_articles(
    target_article=article,
    candidate_articles=news_articles,
    similarity_threshold=0.7
)

Entity Extraction

from openmlcrawler.connectors.news import NewsEntityExtractor

extractor = NewsEntityExtractor()

# Extract named entities
entities = extractor.extract_entities(
    articles=news_articles,
    entity_types=["PERSON", "ORG", "GPE", "MONEY", "DATE"],
    include_context=True
)

# Build entity networks
network = extractor.build_entity_network(
    entities=entities,
    relationship_types=["mentioned_with", "quoted_by"]
)

# Track entity mentions over time
timeline = extractor.track_entity_mentions(
    entity="artificial intelligence",
    articles=news_articles,
    time_window="monthly"
)

Data Quality and Validation

Quality Assurance

  1. Source Credibility: Verify news source reliability
  2. Content Freshness: Check article publication dates
  3. Duplicate Detection: Identify duplicate articles
  4. Spam Filtering: Remove low-quality or spam content
  5. Fact-Checking: Cross-reference with reliable sources

Validation Framework

from openmlcrawler.connectors.news import NewsDataValidator

validator = NewsDataValidator()

# Validate news articles
validation_result = validator.validate_articles(
    articles=news_articles,
    checks=[
        "source_credibility",
        "content_freshness",
        "duplicate_detection",
        "spam_filtering",
        "fact_checking"
    ]
)

# Generate quality report
quality_report = validator.generate_quality_report(
    validation_results=validation_result,
    include_recommendations=True
)

Integration with ML Pipelines

News Classification Pipeline

from openmlcrawler.connectors.news import NewsClassificationPipeline

pipeline = NewsClassificationPipeline()

# Classify news articles
classified_news = pipeline.classify_news(
    articles=raw_articles,
    models=[
        "sentiment_classifier",
        "topic_classifier",
        "fake_news_detector"
    ]
)

# Generate insights
insights = pipeline.generate_insights(
    classified_news=classified_news,
    analysis_types=[
        "sentiment_trends",
        "topic_distribution",
        "source_reliability"
    ]
)

Trend Analysis

from openmlcrawler.connectors.news import NewsTrendAnalyzer

trend_analyzer = NewsTrendAnalyzer()

# Analyze news trends
trends = trend_analyzer.analyze_trends(
    news_data=news_articles,
    time_period="30d",
    metrics=[
        "article_volume",
        "sentiment_trends",
        "topic_popularity",
        "source_coverage"
    ]
)

# Detect breaking news
breaking_news = trend_analyzer.detect_breaking_news(
    recent_articles=recent_articles,
    baseline_articles=baseline_articles,
    threshold=2.0  # 2x normal volume
)

Configuration Options

Global Configuration

news_connectors:
  default_sources: ["newsapi", "google_news"]
  content_analysis:
    enable_sentiment: true
    enable_entity_extraction: true
    enable_topic_modeling: true
  data_quality:
    enable_validation: true
    credibility_threshold: 0.7
    freshness_hours: 24
  caching:
    enable_cache: true
    cache_ttl_minutes: 30
    max_cache_size_gb: 20
  rate_limiting:
    requests_per_minute: 10
    burst_limit: 20

Source-Specific Settings

newsapi:
  api_key: "${NEWSAPI_API_KEY}"
  language: "en"
  country: "us"
  page_size: 100

google_news:
  user_agent: "OpenMLCrawler/1.0"
  region: "US"
  language: "en"
  timeout_seconds: 30

bing_news:
  api_key: "${BING_NEWS_API_KEY}"
  market: "en-US"
  safe_search: "Moderate"
  count: 50

nytimes:
  api_key: "${NYTIMES_API_KEY}"
  timeout_seconds: 30
  retry_attempts: 3

Best Practices

Performance Optimization

  1. Use Caching: News data can be cached for short periods
  2. Batch Requests: Combine multiple queries when possible
  3. Selective Filtering: Use specific search criteria to reduce data volume
  4. Rate Limiting: Respect API rate limits and implement backoff
  5. Content Filtering: Filter out irrelevant or low-quality content early

Cost Management

  1. API Tier Selection: Choose appropriate API tiers for your needs
  2. Usage Monitoring: Track API usage and costs
  3. Query Optimization: Use specific queries to minimize API calls
  4. Data Sampling: Sample news data for analysis instead of full collection
  5. Caching Strategy: Implement intelligent caching for frequently accessed data

Data Reliability

  1. Multiple Sources: Cross-validate news from multiple sources
  2. Source Credibility: Use credibility scoring for source validation
  3. Fact-Checking: Implement fact-checking mechanisms
  4. Freshness Monitoring: Monitor news freshness and timeliness
  5. Bias Detection: Detect potential bias in news coverage

Troubleshooting

Common Issues

API Rate Limiting

Error: API rate limit exceeded
Solution: Implement exponential backoff and reduce request frequency

Content Filtering

Error: No articles found
Solution: Broaden search criteria or check content filters

Source Unavailable

Error: News source temporarily unavailable
Solution: Use alternative sources or implement fallback strategies

Content Parsing

Error: Article content parsing failed
Solution: Check article format and update parsing logic

See Also