Data Connectors¶

OpenML Crawler provides a comprehensive set of data connectors to access various data sources and APIs. These connectors are designed to handle different types of data sources with consistent interfaces and robust error handling.

Overview¶

Data connectors are the foundation of OpenML Crawler's data ingestion capabilities. Each connector is specifically designed to handle the unique characteristics of its data source, including authentication, rate limiting, data format conversion, and error recovery.

Available Connectors¶

Weather Data Connectors¶

Weather APIs: Access real-time and historical weather data from multiple providers
Open-Meteo: Free weather API with global coverage
OpenWeather: Comprehensive weather data with forecasts
NOAA: Official US weather data
Weather Underground: Community-driven weather stations

Social Media APIs: Collect data from social platforms
Twitter API: Real-time tweets, user data, and trends
Reddit API: Subreddit data, posts, and comments
Facebook Graph API: Social media analytics and user data
Instagram API: Media content and user engagement data

Government Data Connectors¶

Government Data: Access official government datasets
data.gov: US government open data portal
data.europa.eu: European Union data portal
data.gov.uk: UK government data
data.gov.in: Indian government data

Financial Data Connectors¶

Financial Data: Market data and financial indicators
Alpha Vantage: Stock market data and technical indicators
Yahoo Finance: Historical stock prices and market data
Federal Reserve: Economic indicators and monetary data
CoinMarketCap: Cryptocurrency data

News Data Connectors¶

News APIs: Real-time news and media content
NewsAPI: Global news from thousands of sources
Google News: Search and trending news
Bing News: Microsoft news search
NY Times API: Premium news content

Health Data Connectors¶

Health Data: Medical and health-related datasets
CDC Data: US Centers for Disease Control data
WHO Data: World Health Organization statistics
PubMed: Medical research papers and abstracts
ClinicalTrials.gov: Clinical trial data

Connector Architecture¶

Base Connector Class¶

All connectors inherit from a common base class that provides:

from openmlcrawler.connectors.base import BaseConnector

class CustomConnector(BaseConnector):
    def __init__(self, api_key=None, rate_limit=100):
        super().__init__(api_key=api_key, rate_limit=rate_limit)

    def connect(self):
        """Establish connection to data source"""
        pass

    def fetch_data(self, query_params):
        """Fetch data from source"""
        pass

    def validate_data(self, data):
        """Validate fetched data"""
        pass

Key Features¶

Rate Limiting: Automatic rate limit handling and backoff strategies
Error Recovery: Robust error handling with retry mechanisms
Data Validation: Built-in data quality checks
Caching: Intelligent caching to reduce API calls
Authentication: Multiple authentication methods (API keys, OAuth, etc.)

Configuration¶

Environment Variables¶

# Weather API Keys
OPENWEATHER_API_KEY=your_openweather_key
WEATHER_UNDERGROUND_API_KEY=your_wunderground_key

# Social Media API Keys
TWITTER_BEARER_TOKEN=your_twitter_token
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_secret

# Financial API Keys
ALPHA_VANTAGE_API_KEY=your_alpha_vantage_key

Configuration File¶

connectors:
  weather:
    providers:
      - openweather
      - openmeteo
    cache_ttl: 3600
    rate_limit: 100

  social_media:
    platforms:
      - twitter
      - reddit
    max_retries: 3
    timeout: 30

Usage Examples¶

Basic Usage¶

from openmlcrawler.connectors import WeatherConnector, SocialMediaConnector

# Weather data
weather = WeatherConnector(api_key="your_key")
data = weather.get_current_weather("New York")

# Social media data
social = SocialMediaConnector(api_key="your_key")
tweets = social.search_tweets("#machinelearning", limit=100)

Advanced Usage with Pipelines¶

from openmlcrawler import Pipeline
from openmlcrawler.connectors import WeatherConnector, DatabaseConnector

# Create pipeline
pipeline = Pipeline()

# Add connectors
weather_conn = WeatherConnector(api_key="your_key")
db_conn = DatabaseConnector(connection_string="postgresql://...")

# Configure pipeline
pipeline.add_source(weather_conn)
pipeline.add_destination(db_conn)
pipeline.add_transform(lambda x: x['temperature'] > 20)  # Filter warm days

# Execute
results = pipeline.run()

Best Practices¶

Performance Optimization¶

Use Caching: Enable caching for frequently accessed data
Batch Requests: Use batch operations when available
Rate Limiting: Respect API rate limits to avoid throttling
Connection Pooling: Reuse connections for multiple requests

Error Handling¶

from openmlcrawler.connectors.base import ConnectorError

try:
    data = connector.fetch_data(params)
except ConnectorError as e:
    if e.retryable:
        # Implement retry logic
        time.sleep(e.retry_after)
        data = connector.fetch_data(params)
    else:
        # Handle permanent errors
        logger.error(f"Failed to fetch data: {e}")

Security Considerations¶

API Key Management: Store keys securely, never in code
Data Encryption: Encrypt sensitive data in transit and at rest
Access Control: Implement proper authentication and authorization
Audit Logging: Log all connector activities for security monitoring

Troubleshooting¶

Common Issues¶

Rate Limit Exceeded¶

Solution: Implement exponential backoff and reduce request frequency

Authentication Failed¶

Solution: Verify API keys and check account permissions

Data Format Errors¶

Solution: Check API documentation for correct parameter formats

Connection Timeouts¶

Solution: Increase timeout values and implement retry logic

Extending Connectors¶

Creating Custom Connectors¶

Inherit from BaseConnector
Implement required methods
Add connector-specific configuration
Register with the connector registry

from openmlcrawler.connectors.base import BaseConnector
from openmlcrawler.connectors.registry import register_connector

@register_connector('custom')
class CustomConnector(BaseConnector):
    def __init__(self, api_key, endpoint):
        super().__init__(api_key=api_key)
        self.endpoint = endpoint

    def connect(self):
        # Implementation
        pass

    def fetch_data(self, query):
        # Implementation
        pass

Testing Connectors¶

import pytest
from openmlcrawler.connectors import CustomConnector

def test_custom_connector():
    connector = CustomConnector(api_key="test_key", endpoint="http://test.com")
    assert connector.connect() is True

    data = connector.fetch_data({"query": "test"})
    assert len(data) > 0