Configuration¶
OpenML Crawler supports various configuration methods to customize behavior, set API keys, and control processing parameters.
Configuration Methods¶
1. Environment Variables¶
Set configuration through environment variables:
# Basic settings
export OPENMLCRAWLER_CACHE_DIR=/tmp/openmlcrawler
export OPENMLCRAWLER_LOG_LEVEL=INFO
export OPENMLCRAWLER_LOG_FILE=/var/log/openmlcrawler.log
# API Keys
export TWITTER_BEARER_TOKEN=your_twitter_token
export REDDIT_CLIENT_ID=your_reddit_client_id
export REDDIT_CLIENT_SECRET=your_reddit_client_secret
export FACEBOOK_ACCESS_TOKEN=your_facebook_token
# Cloud credentials
export AWS_ACCESS_KEY_ID=your_aws_key
export AWS_SECRET_ACCESS_KEY=your_aws_secret
export GOOGLE_CLOUD_PROJECT=your_gcp_project
2. Configuration File¶
Create a YAML configuration file:
# config/openmlcrawler.yaml
version: "1.0"
# Cache settings
cache:
directory: "~/.openmlcrawler/cache"
max_size: "10GB"
ttl: "7d"
compression: "gzip"
# Logging configuration
logging:
level: "INFO"
file: "~/.openmlcrawler/logs/openmlcrawler.log"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
max_file_size: "100MB"
backup_count: 5
# API credentials
credentials:
twitter:
bearer_token: "${TWITTER_BEARER_TOKEN}"
reddit:
client_id: "${REDDIT_CLIENT_ID}"
client_secret: "${REDDIT_CLIENT_SECRET}"
facebook:
access_token: "${FACEBOOK_ACCESS_TOKEN}"
us_gov:
api_key: "${US_GOV_API_KEY}"
eu_gov:
api_key: "${EU_GOV_API_KEY}"
# Cloud storage configuration
cloud:
aws:
region: "us-east-1"
profile: "default"
bucket: "my-openmlcrawler-bucket"
gcp:
project: "my-project"
bucket: "my-openmlcrawler-bucket"
azure:
account_name: "mystorageaccount"
account_key: "${AZURE_ACCOUNT_KEY}"
container: "openmlcrawler-data"
# Processing settings
processing:
max_workers: 4
chunk_size: 1000
timeout: 30
retry_attempts: 3
retry_delay: 1.0
# Data quality settings
quality:
min_completeness: 0.8
max_missing_rate: 0.1
max_duplicate_rate: 0.05
enable_anomaly_detection: true
anomaly_threshold: 0.95
# Self-healing settings
self_healing:
enabled: true
max_retries: 3
base_delay: 1.0
max_delay: 60.0
backoff_factor: 2.0
jitter: true
adaptive_threshold: true
# Monitoring settings
monitoring:
enabled: true
alert_email: "admin@example.com"
alert_webhook: "https://hooks.slack.com/services/..."
metrics_interval: 60
anomaly_detection_window: 1000
# Workflow settings
workflow:
max_concurrent: 5
timeout: 3600
enable_progress_bars: true
save_intermediate: false
3. Runtime Configuration¶
Configure settings programmatically:
from openmlcrawler.core.config import OpenMLCrawlerConfig
# Create configuration
config = OpenMLCrawlerConfig()
# Set cache settings
config.cache.directory = "/tmp/cache"
config.cache.max_size = "5GB"
# Set API credentials
config.credentials.twitter.bearer_token = "your_token"
config.credentials.reddit.client_id = "your_client_id"
# Set processing options
config.processing.max_workers = 8
config.processing.chunk_size = 2000
# Apply configuration
config.apply()
Configuration Sections¶
Cache Configuration¶
Control data caching behavior:
cache:
directory: "~/.openmlcrawler/cache" # Cache directory
max_size: "10GB" # Maximum cache size
ttl: "7d" # Time to live
compression: "gzip" # Compression method
cleanup_interval: "1h" # Cleanup interval
Logging Configuration¶
Configure logging output:
logging:
level: "INFO" # Log level (DEBUG, INFO, WARNING, ERROR)
file: "~/.openmlcrawler/logs/openmlcrawler.log" # Log file path
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
max_file_size: "100MB" # Max log file size
backup_count: 5 # Number of backup files
console: true # Enable console logging
Connector Configuration¶
Configure individual connectors:
connectors:
weather:
default_provider: "open-meteo" # Default weather provider
cache_weather: true # Cache weather data
timeout: 10 # Request timeout
twitter:
bearer_token: "${TWITTER_BEARER_TOKEN}"
max_results: 100 # Default max results
include_retweets: false # Include retweets
language_filter: ["en"] # Language filter
reddit:
client_id: "${REDDIT_CLIENT_ID}"
client_secret: "${REDDIT_CLIENT_SECRET}"
user_agent: "OpenMLCrawler/1.0"
rate_limit: 60 # Requests per minute
facebook:
access_token: "${FACEBOOK_ACCESS_TOKEN}"
version: "v18.0" # Graph API version
timeout: 15 # Request timeout
Cloud Configuration¶
Configure cloud storage providers:
cloud:
aws:
region: "us-east-1"
profile: "default"
bucket: "my-bucket"
prefix: "data/" # Object prefix
acl: "private" # Access control
gcp:
project: "my-project"
bucket: "my-bucket"
credentials_file: "~/.gcp/credentials.json"
azure:
account_name: "mystorageaccount"
account_key: "${AZURE_ACCOUNT_KEY}"
container: "data"
timeout: 30
Processing Configuration¶
Control data processing behavior:
processing:
max_workers: 4 # Maximum worker threads
chunk_size: 1000 # Processing chunk size
timeout: 30 # Operation timeout
retry_attempts: 3 # Retry attempts
retry_delay: 1.0 # Retry delay
memory_limit: "2GB" # Memory limit per worker
enable_progress_bars: true # Show progress bars
Quality Configuration¶
Configure data quality checks:
quality:
min_completeness: 0.8 # Minimum completeness score
max_missing_rate: 0.1 # Maximum missing rate
max_duplicate_rate: 0.05 # Maximum duplicate rate
enable_anomaly_detection: true # Enable anomaly detection
anomaly_threshold: 0.95 # Anomaly detection threshold
outlier_method: "iqr" # Outlier detection method
validation_rules: [] # Custom validation rules
Self-Healing Configuration¶
Configure automatic error recovery:
self_healing:
enabled: true # Enable self-healing
max_retries: 3 # Maximum retry attempts
base_delay: 1.0 # Base retry delay
max_delay: 60.0 # Maximum retry delay
backoff_factor: 2.0 # Exponential backoff factor
jitter: true # Add randomness to delays
adaptive_threshold: true # Adaptive anomaly threshold
fallback_sources: [] # Fallback data sources
Monitoring Configuration¶
Configure real-time monitoring:
monitoring:
enabled: true # Enable monitoring
metrics_interval: 60 # Metrics collection interval
alert_email: "admin@example.com" # Alert email address
alert_webhook: "https://hooks.slack.com/..." # Alert webhook
anomaly_detection_window: 1000 # Anomaly detection window
performance_threshold: 0.9 # Performance threshold
enable_dashboard: true # Enable monitoring dashboard
Configuration Loading¶
Automatic Configuration Loading¶
OpenML Crawler automatically loads configuration from:
- Environment variables (highest priority)
- Configuration file:
~/.openmlcrawler/config.yaml
- Project configuration:
./openmlcrawler.yaml
- Default settings (lowest priority)
Manual Configuration Loading¶
Load configuration programmatically:
from openmlcrawler.core.config import load_config, save_config
# Load from file
config = load_config("my_config.yaml")
# Modify settings
config.processing.max_workers = 8
config.cache.max_size = "20GB"
# Save configuration
save_config(config, "my_config.yaml")
Configuration Validation¶
Validate configuration settings:
from openmlcrawler.core.config import validate_config
# Validate configuration
errors = validate_config(config)
if errors:
for error in errors:
print(f"Configuration error: {error}")
else:
print("Configuration is valid")
Environment-Specific Configuration¶
Development Configuration¶
# config/dev.yaml
logging:
level: "DEBUG"
console: true
cache:
directory: "./cache"
max_size: "1GB"
processing:
max_workers: 2
enable_progress_bars: true
Production Configuration¶
# config/prod.yaml
logging:
level: "WARNING"
file: "/var/log/openmlcrawler.log"
cache:
directory: "/var/cache/openmlcrawler"
max_size: "100GB"
processing:
max_workers: 16
memory_limit: "8GB"
monitoring:
enabled: true
alert_email: "ops@company.com"
Testing Configuration¶
# config/test.yaml
cache:
directory: "/tmp/openmlcrawler_test"
max_size: "100MB"
processing:
max_workers: 1
timeout: 5
logging:
level: "ERROR"
console: false
Configuration Examples¶
Basic Setup¶
# Minimal configuration
version: "1.0"
cache:
directory: "~/.openmlcrawler/cache"
logging:
level: "INFO"
credentials:
twitter:
bearer_token: "your_token_here"
Advanced Setup¶
# Advanced configuration with all features
version: "1.0"
cache:
directory: "/data/cache"
max_size: "50GB"
compression: "lz4"
logging:
level: "INFO"
file: "/var/log/openmlcrawler/app.log"
format: "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
credentials:
twitter:
bearer_token: "${TWITTER_BEARER_TOKEN}"
reddit:
client_id: "${REDDIT_CLIENT_ID}"
client_secret: "${REDDIT_CLIENT_SECRET}"
cloud:
aws:
region: "us-west-2"
bucket: "production-data"
processing:
max_workers: 12
chunk_size: 5000
timeout: 60
quality:
min_completeness: 0.9
enable_anomaly_detection: true
self_healing:
enabled: true
max_retries: 5
adaptive_threshold: true
monitoring:
enabled: true
alert_email: "alerts@company.com"
metrics_interval: 30
Configuration Best Practices¶
Security¶
- Never commit secrets to version control
- Use environment variables for sensitive data
- Rotate API keys regularly
- Use IAM roles for cloud access when possible
Performance¶
- Tune worker counts based on system resources
- Set appropriate cache sizes for your use case
- Configure timeouts to prevent hanging operations
- Enable compression for large datasets
Monitoring¶
- Set up alerts for critical operations
- Monitor resource usage and performance
- Log errors and warnings appropriately
- Use appropriate log levels for different environments
Maintenance¶
- Regularly clean cache directories
- Monitor log file sizes and rotate as needed
- Update configuration when adding new features
- Test configuration changes in staging first