Data Connectors¶
OpenML Crawler provides a comprehensive set of data connectors to access various data sources and APIs. These connectors are designed to handle different types of data sources with consistent interfaces and robust error handling.
Overview¶
Data connectors are the foundation of OpenML Crawler's data ingestion capabilities. Each connector is specifically designed to handle the unique characteristics of its data source, including authentication, rate limiting, data format conversion, and error recovery.
Available Connectors¶
Weather Data Connectors¶
- Weather APIs: Access real-time and historical weather data from multiple providers
- Open-Meteo: Free weather API with global coverage
- OpenWeather: Comprehensive weather data with forecasts
- NOAA: Official US weather data
- Weather Underground: Community-driven weather stations
Social Media Connectors¶
- Social Media APIs: Collect data from social platforms
- Twitter API: Real-time tweets, user data, and trends
- Reddit API: Subreddit data, posts, and comments
- Facebook Graph API: Social media analytics and user data
- Instagram API: Media content and user engagement data
Government Data Connectors¶
- Government Data: Access official government datasets
- data.gov: US government open data portal
- data.europa.eu: European Union data portal
- data.gov.uk: UK government data
- data.gov.in: Indian government data
Financial Data Connectors¶
- Financial Data: Market data and financial indicators
- Alpha Vantage: Stock market data and technical indicators
- Yahoo Finance: Historical stock prices and market data
- Federal Reserve: Economic indicators and monetary data
- CoinMarketCap: Cryptocurrency data
News Data Connectors¶
- News APIs: Real-time news and media content
- NewsAPI: Global news from thousands of sources
- Google News: Search and trending news
- Bing News: Microsoft news search
- NY Times API: Premium news content
Health Data Connectors¶
- Health Data: Medical and health-related datasets
- CDC Data: US Centers for Disease Control data
- WHO Data: World Health Organization statistics
- PubMed: Medical research papers and abstracts
- ClinicalTrials.gov: Clinical trial data
Connector Architecture¶
Base Connector Class¶
All connectors inherit from a common base class that provides:
from openmlcrawler.connectors.base import BaseConnector
class CustomConnector(BaseConnector):
def __init__(self, api_key=None, rate_limit=100):
super().__init__(api_key=api_key, rate_limit=rate_limit)
def connect(self):
"""Establish connection to data source"""
pass
def fetch_data(self, query_params):
"""Fetch data from source"""
pass
def validate_data(self, data):
"""Validate fetched data"""
pass
Key Features¶
- Rate Limiting: Automatic rate limit handling and backoff strategies
- Error Recovery: Robust error handling with retry mechanisms
- Data Validation: Built-in data quality checks
- Caching: Intelligent caching to reduce API calls
- Authentication: Multiple authentication methods (API keys, OAuth, etc.)
Configuration¶
Environment Variables¶
# Weather API Keys
OPENWEATHER_API_KEY=your_openweather_key
WEATHER_UNDERGROUND_API_KEY=your_wunderground_key
# Social Media API Keys
TWITTER_BEARER_TOKEN=your_twitter_token
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_secret
# Financial API Keys
ALPHA_VANTAGE_API_KEY=your_alpha_vantage_key
Configuration File¶
connectors:
weather:
providers:
- openweather
- openmeteo
cache_ttl: 3600
rate_limit: 100
social_media:
platforms:
- twitter
- reddit
max_retries: 3
timeout: 30
Usage Examples¶
Basic Usage¶
from openmlcrawler.connectors import WeatherConnector, SocialMediaConnector
# Weather data
weather = WeatherConnector(api_key="your_key")
data = weather.get_current_weather("New York")
# Social media data
social = SocialMediaConnector(api_key="your_key")
tweets = social.search_tweets("#machinelearning", limit=100)
Advanced Usage with Pipelines¶
from openmlcrawler import Pipeline
from openmlcrawler.connectors import WeatherConnector, DatabaseConnector
# Create pipeline
pipeline = Pipeline()
# Add connectors
weather_conn = WeatherConnector(api_key="your_key")
db_conn = DatabaseConnector(connection_string="postgresql://...")
# Configure pipeline
pipeline.add_source(weather_conn)
pipeline.add_destination(db_conn)
pipeline.add_transform(lambda x: x['temperature'] > 20) # Filter warm days
# Execute
results = pipeline.run()
Best Practices¶
Performance Optimization¶
- Use Caching: Enable caching for frequently accessed data
- Batch Requests: Use batch operations when available
- Rate Limiting: Respect API rate limits to avoid throttling
- Connection Pooling: Reuse connections for multiple requests
Error Handling¶
from openmlcrawler.connectors.base import ConnectorError
try:
data = connector.fetch_data(params)
except ConnectorError as e:
if e.retryable:
# Implement retry logic
time.sleep(e.retry_after)
data = connector.fetch_data(params)
else:
# Handle permanent errors
logger.error(f"Failed to fetch data: {e}")
Security Considerations¶
- API Key Management: Store keys securely, never in code
- Data Encryption: Encrypt sensitive data in transit and at rest
- Access Control: Implement proper authentication and authorization
- Audit Logging: Log all connector activities for security monitoring
Troubleshooting¶
Common Issues¶
Rate Limit Exceeded¶
Authentication Failed¶
Data Format Errors¶
Connection Timeouts¶
Extending Connectors¶
Creating Custom Connectors¶
- Inherit from
BaseConnector
- Implement required methods
- Add connector-specific configuration
- Register with the connector registry
from openmlcrawler.connectors.base import BaseConnector
from openmlcrawler.connectors.registry import register_connector
@register_connector('custom')
class CustomConnector(BaseConnector):
def __init__(self, api_key, endpoint):
super().__init__(api_key=api_key)
self.endpoint = endpoint
def connect(self):
# Implementation
pass
def fetch_data(self, query):
# Implementation
pass
Testing Connectors¶
import pytest
from openmlcrawler.connectors import CustomConnector
def test_custom_connector():
connector = CustomConnector(api_key="test_key", endpoint="http://test.com")
assert connector.connect() is True
data = connector.fetch_data({"query": "test"})
assert len(data) > 0
See Also¶
- Data Processing - Processing data from connectors
- Quality & Privacy - Data quality and privacy controls
- API Reference - Detailed API documentation
- Tutorials - Creating custom connectors