Configuration¶

OpenML Crawler supports various configuration methods to customize behavior, set API keys, and control processing parameters.

Configuration Methods¶

1. Environment Variables¶

Set configuration through environment variables:

# Basic settings
export OPENMLCRAWLER_CACHE_DIR=/tmp/openmlcrawler
export OPENMLCRAWLER_LOG_LEVEL=INFO
export OPENMLCRAWLER_LOG_FILE=/var/log/openmlcrawler.log

# API Keys
export TWITTER_BEARER_TOKEN=your_twitter_token
export REDDIT_CLIENT_ID=your_reddit_client_id
export REDDIT_CLIENT_SECRET=your_reddit_client_secret
export FACEBOOK_ACCESS_TOKEN=your_facebook_token

# Cloud credentials
export AWS_ACCESS_KEY_ID=your_aws_key
export AWS_SECRET_ACCESS_KEY=your_aws_secret
export GOOGLE_CLOUD_PROJECT=your_gcp_project

2. Configuration File¶

Create a YAML configuration file:

# config/openmlcrawler.yaml
version: "1.0"

# Cache settings
cache:
  directory: "~/.openmlcrawler/cache"
  max_size: "10GB"
  ttl: "7d"
  compression: "gzip"

# Logging configuration
logging:
  level: "INFO"
  file: "~/.openmlcrawler/logs/openmlcrawler.log"
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
  max_file_size: "100MB"
  backup_count: 5

# API credentials
credentials:
  twitter:
    bearer_token: "${TWITTER_BEARER_TOKEN}"
  reddit:
    client_id: "${REDDIT_CLIENT_ID}"
    client_secret: "${REDDIT_CLIENT_SECRET}"
  facebook:
    access_token: "${FACEBOOK_ACCESS_TOKEN}"
  us_gov:
    api_key: "${US_GOV_API_KEY}"
  eu_gov:
    api_key: "${EU_GOV_API_KEY}"

# Cloud storage configuration
cloud:
  aws:
    region: "us-east-1"
    profile: "default"
    bucket: "my-openmlcrawler-bucket"
  gcp:
    project: "my-project"
    bucket: "my-openmlcrawler-bucket"
  azure:
    account_name: "mystorageaccount"
    account_key: "${AZURE_ACCOUNT_KEY}"
    container: "openmlcrawler-data"

# Processing settings
processing:
  max_workers: 4
  chunk_size: 1000
  timeout: 30
  retry_attempts: 3
  retry_delay: 1.0

# Data quality settings
quality:
  min_completeness: 0.8
  max_missing_rate: 0.1
  max_duplicate_rate: 0.05
  enable_anomaly_detection: true
  anomaly_threshold: 0.95

# Self-healing settings
self_healing:
  enabled: true
  max_retries: 3
  base_delay: 1.0
  max_delay: 60.0
  backoff_factor: 2.0
  jitter: true
  adaptive_threshold: true

# Monitoring settings
monitoring:
  enabled: true
  alert_email: "admin@example.com"
  alert_webhook: "https://hooks.slack.com/services/..."
  metrics_interval: 60
  anomaly_detection_window: 1000

# Workflow settings
workflow:
  max_concurrent: 5
  timeout: 3600
  enable_progress_bars: true
  save_intermediate: false

3. Runtime Configuration¶

Configure settings programmatically:

from openmlcrawler.core.config import OpenMLCrawlerConfig

# Create configuration
config = OpenMLCrawlerConfig()

# Set cache settings
config.cache.directory = "/tmp/cache"
config.cache.max_size = "5GB"

# Set API credentials
config.credentials.twitter.bearer_token = "your_token"
config.credentials.reddit.client_id = "your_client_id"

# Set processing options
config.processing.max_workers = 8
config.processing.chunk_size = 2000

# Apply configuration
config.apply()

Configuration Sections¶

Cache Configuration¶

Control data caching behavior:

cache:
  directory: "~/.openmlcrawler/cache"  # Cache directory
  max_size: "10GB"                     # Maximum cache size
  ttl: "7d"                           # Time to live
  compression: "gzip"                 # Compression method
  cleanup_interval: "1h"              # Cleanup interval

Logging Configuration¶

Configure logging output:

logging:
  level: "INFO"                       # Log level (DEBUG, INFO, WARNING, ERROR)
  file: "~/.openmlcrawler/logs/openmlcrawler.log"  # Log file path
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
  max_file_size: "100MB"              # Max log file size
  backup_count: 5                     # Number of backup files
  console: true                       # Enable console logging

Connector Configuration¶

Configure individual connectors:

connectors:
  weather:
    default_provider: "open-meteo"    # Default weather provider
    cache_weather: true               # Cache weather data
    timeout: 10                       # Request timeout

  twitter:
    bearer_token: "${TWITTER_BEARER_TOKEN}"
    max_results: 100                  # Default max results
    include_retweets: false           # Include retweets
    language_filter: ["en"]           # Language filter

  reddit:
    client_id: "${REDDIT_CLIENT_ID}"
    client_secret: "${REDDIT_CLIENT_SECRET}"
    user_agent: "OpenMLCrawler/1.0"
    rate_limit: 60                    # Requests per minute

  facebook:
    access_token: "${FACEBOOK_ACCESS_TOKEN}"
    version: "v18.0"                  # Graph API version
    timeout: 15                       # Request timeout

Cloud Configuration¶

Configure cloud storage providers:

cloud:
  aws:
    region: "us-east-1"
    profile: "default"
    bucket: "my-bucket"
    prefix: "data/"                   # Object prefix
    acl: "private"                    # Access control

  gcp:
    project: "my-project"
    bucket: "my-bucket"
    credentials_file: "~/.gcp/credentials.json"

  azure:
    account_name: "mystorageaccount"
    account_key: "${AZURE_ACCOUNT_KEY}"
    container: "data"
    timeout: 30

Processing Configuration¶

Control data processing behavior:

processing:
  max_workers: 4                      # Maximum worker threads
  chunk_size: 1000                    # Processing chunk size
  timeout: 30                         # Operation timeout
  retry_attempts: 3                   # Retry attempts
  retry_delay: 1.0                    # Retry delay
  memory_limit: "2GB"                 # Memory limit per worker
  enable_progress_bars: true          # Show progress bars

Quality Configuration¶

Configure data quality checks:

quality:
  min_completeness: 0.8              # Minimum completeness score
  max_missing_rate: 0.1              # Maximum missing rate
  max_duplicate_rate: 0.05           # Maximum duplicate rate
  enable_anomaly_detection: true     # Enable anomaly detection
  anomaly_threshold: 0.95            # Anomaly detection threshold
  outlier_method: "iqr"              # Outlier detection method
  validation_rules: []               # Custom validation rules

Self-Healing Configuration¶

Configure automatic error recovery:

self_healing:
  enabled: true                      # Enable self-healing
  max_retries: 3                     # Maximum retry attempts
  base_delay: 1.0                    # Base retry delay
  max_delay: 60.0                    # Maximum retry delay
  backoff_factor: 2.0                # Exponential backoff factor
  jitter: true                       # Add randomness to delays
  adaptive_threshold: true           # Adaptive anomaly threshold
  fallback_sources: []               # Fallback data sources

Monitoring Configuration¶

Configure real-time monitoring:

monitoring:
  enabled: true                      # Enable monitoring
  metrics_interval: 60               # Metrics collection interval
  alert_email: "admin@example.com"   # Alert email address
  alert_webhook: "https://hooks.slack.com/..."  # Alert webhook
  anomaly_detection_window: 1000     # Anomaly detection window
  performance_threshold: 0.9         # Performance threshold
  enable_dashboard: true             # Enable monitoring dashboard

Configuration Loading¶

Automatic Configuration Loading¶

OpenML Crawler automatically loads configuration from:

Environment variables (highest priority)
Configuration file: ~/.openmlcrawler/config.yaml
Project configuration: ./openmlcrawler.yaml
Default settings (lowest priority)

Manual Configuration Loading¶

Load configuration programmatically:

from openmlcrawler.core.config import load_config, save_config

# Load from file
config = load_config("my_config.yaml")

# Modify settings
config.processing.max_workers = 8
config.cache.max_size = "20GB"

# Save configuration
save_config(config, "my_config.yaml")

Configuration Validation¶

Validate configuration settings:

from openmlcrawler.core.config import validate_config

# Validate configuration
errors = validate_config(config)

if errors:
    for error in errors:
        print(f"Configuration error: {error}")
else:
    print("Configuration is valid")

Environment-Specific Configuration¶

Development Configuration¶

# config/dev.yaml
logging:
  level: "DEBUG"
  console: true

cache:
  directory: "./cache"
  max_size: "1GB"

processing:
  max_workers: 2
  enable_progress_bars: true

Production Configuration¶

# config/prod.yaml
logging:
  level: "WARNING"
  file: "/var/log/openmlcrawler.log"

cache:
  directory: "/var/cache/openmlcrawler"
  max_size: "100GB"

processing:
  max_workers: 16
  memory_limit: "8GB"

monitoring:
  enabled: true
  alert_email: "ops@company.com"

Testing Configuration¶

# config/test.yaml
cache:
  directory: "/tmp/openmlcrawler_test"
  max_size: "100MB"

processing:
  max_workers: 1
  timeout: 5

logging:
  level: "ERROR"
  console: false

Configuration Examples¶

Basic Setup¶

# Minimal configuration
version: "1.0"

cache:
  directory: "~/.openmlcrawler/cache"

logging:
  level: "INFO"

credentials:
  twitter:
    bearer_token: "your_token_here"

Advanced Setup¶

# Advanced configuration with all features
version: "1.0"

cache:
  directory: "/data/cache"
  max_size: "50GB"
  compression: "lz4"

logging:
  level: "INFO"
  file: "/var/log/openmlcrawler/app.log"
  format: "%(asctime)s [%(levelname)s] %(name)s: %(message)s"

credentials:
  twitter:
    bearer_token: "${TWITTER_BEARER_TOKEN}"
  reddit:
    client_id: "${REDDIT_CLIENT_ID}"
    client_secret: "${REDDIT_CLIENT_SECRET}"

cloud:
  aws:
    region: "us-west-2"
    bucket: "production-data"

processing:
  max_workers: 12
  chunk_size: 5000
  timeout: 60

quality:
  min_completeness: 0.9
  enable_anomaly_detection: true

self_healing:
  enabled: true
  max_retries: 5
  adaptive_threshold: true

monitoring:
  enabled: true
  alert_email: "alerts@company.com"
  metrics_interval: 30

Configuration Best Practices¶

Security¶

Never commit secrets to version control
Use environment variables for sensitive data
Rotate API keys regularly
Use IAM roles for cloud access when possible

Performance¶

Tune worker counts based on system resources
Set appropriate cache sizes for your use case
Configure timeouts to prevent hanging operations
Enable compression for large datasets

Monitoring¶

Set up alerts for critical operations
Monitor resource usage and performance
Log errors and warnings appropriately
Use appropriate log levels for different environments

Maintenance¶

Regularly clean cache directories
Monitor log file sizes and rotate as needed
Update configuration when adding new features
Test configuration changes in staging first