Installation¶
This guide will help you install OpenML Crawler and get started with data collection and processing.
Prerequisites¶
System Requirements¶
- Python: 3.8 or higher
- Operating System: Windows, macOS, or Linux
- Memory: At least 4GB RAM (8GB recommended for large datasets)
- Storage: At least 2GB free space
Python Dependencies¶
The following packages are automatically installed with OpenML Crawler:
pandas
- Data manipulation and analysisrequests
- HTTP client for API callsnumpy
- Numerical computingscikit-learn
- Machine learning algorithmsbeautifulsoup4
- HTML parsinglxml
- XML parsingaiohttp
- Async HTTP clientclick
- Command-line interfacepyyaml
- YAML configuration filestqdm
- Progress bars
Installation Methods¶
Method 1: Install from PyPI (Recommended)¶
The easiest way to install OpenML Crawler is from the Python Package Index:
Virtual Environment
It's recommended to use a virtual environment to avoid conflicts with system packages:
# Create virtual environment
python -m venv openmlcrawler-env
# Activate virtual environment
# On Windows:
openmlcrawler-env\Scripts\activate
# On macOS/Linux:
source openmlcrawler-env/bin/activate
# Install OpenML Crawler
pip install openmlcrawler
Method 2: Install from Source¶
If you want the latest development version or want to contribute to the project:
# Clone the repository
git clone https://github.com/krish567366/openmlcrawler.git
cd openmlcrawler
# Install in development mode
pip install -e .
Method 3: Install with Development Dependencies¶
For contributors or users who want to run tests and use development tools:
# Install with development dependencies
pip install -e ".[dev]"
# Or install specific extras
pip install -e ".[ml]" # ML pipeline support
pip install -e ".[cloud]" # Cloud storage support
pip install -e ".[nlp]" # NLP processing support
pip install -e ".[all]" # All optional dependencies
Optional Dependencies¶
OpenML Crawler supports various optional dependencies for extended functionality:
Machine Learning Support¶
Cloud Storage Support¶
NLP Processing Support¶
Web Scraping Support¶
Database Support¶
All Features¶
Verification¶
After installation, verify that OpenML Crawler is working correctly:
# Check version
python -c "import openmlcrawler; print(openmlcrawler.__version__)"
# Test basic functionality
python -c "from openmlcrawler import load_dataset; print('Installation successful!')"
# Test CLI
openmlcrawler --help
Configuration¶
Environment Variables¶
Set up environment variables for API keys and configuration:
# Create .env file in your project directory
echo "OPENMLCRAWLER_CACHE_DIR=/path/to/cache" > .env
echo "OPENMLCRAWLER_LOG_LEVEL=INFO" >> .env
# For API keys (optional)
echo "TWITTER_BEARER_TOKEN=your_token_here" >> .env
echo "REDDIT_CLIENT_ID=your_client_id" >> .env
echo "FACEBOOK_ACCESS_TOKEN=your_token" >> .env
Configuration File¶
Create a configuration file for advanced settings:
# config/openmlcrawler.yaml
cache:
directory: "~/.openmlcrawler/cache"
max_size: "10GB"
ttl: "7d"
logging:
level: "INFO"
file: "~/.openmlcrawler/logs/openmlcrawler.log"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
connectors:
twitter:
bearer_token: "${TWITTER_BEARER_TOKEN}"
reddit:
client_id: "${REDDIT_CLIENT_ID}"
client_secret: "${REDDIT_CLIENT_SECRET}"
facebook:
access_token: "${FACEBOOK_ACCESS_TOKEN}"
cloud:
aws:
region: "us-east-1"
profile: "default"
gcp:
project: "my-project"
azure:
account_name: "mystorageaccount"
Troubleshooting¶
Common Installation Issues¶
1. Permission Errors¶
If you get permission errors during installation:
# Use --user flag
pip install --user openmlcrawler
# Or use sudo (not recommended)
sudo pip install openmlcrawler
2. SSL Certificate Issues¶
If you encounter SSL certificate verification errors:
# Disable SSL verification (not recommended for production)
pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org openmlcrawler
# Or update certificates
pip install --upgrade certifi
3. Dependency Conflicts¶
If you have dependency conflicts with existing packages:
# Use pip-tools for dependency management
pip install pip-tools
pip-compile requirements.in
pip-sync
# Or create a clean virtual environment
python -m venv clean-env
source clean-env/bin/activate # On Windows: clean-env\Scripts\activate
pip install openmlcrawler
Platform-Specific Issues¶
Windows¶
# Install Microsoft Visual C++ Build Tools if needed
# Download from: https://visualstudio.microsoft.com/visual-cpp-build-tools/
# Use PowerShell instead of Command Prompt for better compatibility
# Activate virtual environment
.\venv\Scripts\Activate.ps1
macOS¶
# Install Xcode command line tools
xcode-select --install
# Use Homebrew for additional dependencies
brew install libxml2 libxslt
Linux¶
# Install system dependencies
sudo apt-get update
sudo apt-get install python3-dev libxml2-dev libxslt-dev
# Or for CentOS/RHEL
sudo yum install python3-devel libxml2-devel libxslt-devel
Next Steps¶
After successful installation:
- Read the Quick Start Guide: Learn basic usage patterns
- Explore Connectors: See available data sources
- Try Tutorials: Follow step-by-step examples
- Check API Reference: Understand advanced features
Support¶
If you encounter issues during installation:
- Check the Troubleshooting Guide
- Search existing GitHub Issues
- Ask questions in GitHub Discussions
- Contact support at krishna@krishnabajpai.me