Installation¶

This guide will help you install OpenML Crawler and get started with data collection and processing.

Prerequisites¶

System Requirements¶

Python: 3.8 or higher
Operating System: Windows, macOS, or Linux
Memory: At least 4GB RAM (8GB recommended for large datasets)
Storage: At least 2GB free space

Python Dependencies¶

The following packages are automatically installed with OpenML Crawler:

pandas - Data manipulation and analysis
requests - HTTP client for API calls
numpy - Numerical computing
scikit-learn - Machine learning algorithms
beautifulsoup4 - HTML parsing
lxml - XML parsing
aiohttp - Async HTTP client
click - Command-line interface
pyyaml - YAML configuration files
tqdm - Progress bars

Installation Methods¶

Method 1: Install from PyPI (Recommended)¶

The easiest way to install OpenML Crawler is from the Python Package Index:

pip install openmlcrawler

Virtual Environment

It's recommended to use a virtual environment to avoid conflicts with system packages:

    # Create virtual environment
    python -m venv openmlcrawler-env

    # Activate virtual environment
    # On Windows:
    openmlcrawler-env\Scripts\activate
    # On macOS/Linux:
    source openmlcrawler-env/bin/activate

    # Install OpenML Crawler
    pip install openmlcrawler

Method 2: Install from Source¶

If you want the latest development version or want to contribute to the project:

# Clone the repository
git clone https://github.com/krish567366/openmlcrawler.git
cd openmlcrawler

# Install in development mode
pip install -e .

Method 3: Install with Development Dependencies¶

For contributors or users who want to run tests and use development tools:

# Install with development dependencies
pip install -e ".[dev]"

# Or install specific extras
pip install -e ".[ml]"        # ML pipeline support
pip install -e ".[cloud]"     # Cloud storage support
pip install -e ".[nlp]"       # NLP processing support
pip install -e ".[all]"       # All optional dependencies

Optional Dependencies¶

OpenML Crawler supports various optional dependencies for extended functionality:

Machine Learning Support¶

pip install openmlcrawler[ml]
# Includes: scikit-learn, xgboost, lightgbm, catboost

Cloud Storage Support¶

pip install openmlcrawler[cloud]
# Includes: boto3, google-cloud-storage, azure-storage-blob

NLP Processing Support¶

pip install openmlcrawler[nlp]
# Includes: spacy, transformers, nltk

Web Scraping Support¶

pip install openmlcrawler[scraping]
# Includes: selenium, playwright, scrapy

Database Support¶

pip install openmlcrawler[database]
# Includes: sqlalchemy, psycopg2, pymongo

All Features¶

pip install openmlcrawler[all]
# Includes all optional dependencies

Verification¶

After installation, verify that OpenML Crawler is working correctly:

# Check version
python -c "import openmlcrawler; print(openmlcrawler.__version__)"

# Test basic functionality
python -c "from openmlcrawler import load_dataset; print('Installation successful!')"

# Test CLI
openmlcrawler --help

Configuration¶

Environment Variables¶

Set up environment variables for API keys and configuration:

# Create .env file in your project directory
echo "OPENMLCRAWLER_CACHE_DIR=/path/to/cache" > .env
echo "OPENMLCRAWLER_LOG_LEVEL=INFO" >> .env

# For API keys (optional)
echo "TWITTER_BEARER_TOKEN=your_token_here" >> .env
echo "REDDIT_CLIENT_ID=your_client_id" >> .env
echo "FACEBOOK_ACCESS_TOKEN=your_token" >> .env

Configuration File¶

Create a configuration file for advanced settings:

# config/openmlcrawler.yaml
cache:
  directory: "~/.openmlcrawler/cache"
  max_size: "10GB"
  ttl: "7d"

logging:
  level: "INFO"
  file: "~/.openmlcrawler/logs/openmlcrawler.log"
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

connectors:
  twitter:
    bearer_token: "${TWITTER_BEARER_TOKEN}"
  reddit:
    client_id: "${REDDIT_CLIENT_ID}"
    client_secret: "${REDDIT_CLIENT_SECRET}"
  facebook:
    access_token: "${FACEBOOK_ACCESS_TOKEN}"

cloud:
  aws:
    region: "us-east-1"
    profile: "default"
  gcp:
    project: "my-project"
  azure:
    account_name: "mystorageaccount"

Troubleshooting¶

Common Installation Issues¶

1. Permission Errors¶

If you get permission errors during installation:

# Use --user flag
pip install --user openmlcrawler

# Or use sudo (not recommended)
sudo pip install openmlcrawler

2. SSL Certificate Issues¶

If you encounter SSL certificate verification errors:

# Disable SSL verification (not recommended for production)
pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org openmlcrawler

# Or update certificates
pip install --upgrade certifi

3. Dependency Conflicts¶

If you have dependency conflicts with existing packages:

# Use pip-tools for dependency management
pip install pip-tools
pip-compile requirements.in
pip-sync

# Or create a clean virtual environment
python -m venv clean-env
source clean-env/bin/activate  # On Windows: clean-env\Scripts\activate
pip install openmlcrawler

Platform-Specific Issues¶

Windows¶

# Install Microsoft Visual C++ Build Tools if needed
# Download from: https://visualstudio.microsoft.com/visual-cpp-build-tools/

# Use PowerShell instead of Command Prompt for better compatibility
# Activate virtual environment
.\venv\Scripts\Activate.ps1

macOS¶

# Install Xcode command line tools
xcode-select --install

# Use Homebrew for additional dependencies
brew install libxml2 libxslt

Linux¶

# Install system dependencies
sudo apt-get update
sudo apt-get install python3-dev libxml2-dev libxslt-dev

# Or for CentOS/RHEL
sudo yum install python3-devel libxml2-devel libxslt-devel

Next Steps¶

After successful installation:

Read the Quick Start Guide: Learn basic usage patterns
Explore Connectors: See available data sources
Try Tutorials: Follow step-by-step examples
Check API Reference: Understand advanced features

Support¶

If you encounter issues during installation:

Check the Troubleshooting Guide
Search existing GitHub Issues
Ask questions in GitHub Discussions
Contact support at krishna@krishnabajpai.me