OpenML Crawler¶

A unified framework for crawling and preparing ML-ready datasets from various sources including web APIs, open data portals, and custom data sources.

🚀 Quick Start¶

from openmlcrawler import load_dataset

# Load weather data
df = load_dataset("weather", location="Delhi", days=7)

# Load Twitter data
df = load_dataset("twitter", query="machine learning", max_results=50)

# Load government data
df = load_dataset("us_gov", query="climate change", limit=20)

print(f"Loaded {len(df)} records")

✨ Features¶

🔌 Connectors (Free APIs + Curated Data Sources)¶

Weather: Open-Meteo, OpenWeather, NOAA
Social Media: Twitter/X API, Reddit API, Facebook Graph API
Government Data: US data.gov, EU Open Data, UK data.gov.uk, Indian data.gov.in
Finance: Yahoo Finance, Alpha Vantage, CoinGecko
Knowledge: Wikipedia, Wikidata
News: GNews, Reddit, HackerNews
Health: WHO, Johns Hopkins, FDA Open Data
Agriculture: FAO, USDA, Government open data portals
Energy: EIA, IEA

🕷️ Generic Web Crawling¶

Support for CSV, JSON, XML, HTML parsing
PDF parsing with pdfplumber/PyPDF2
Async crawling with aiohttp
Headless browser mode with Playwright/Selenium
Auto format detection (mimetype, file extension)

🧹 Data Cleaning & Processing¶

Deduplication and anomaly detection
Missing value handling
Auto type detection (int, float, datetime, category)
Text cleaning (stopwords, stemming, lemmatization)
NLP utilities: language detection, translation, NER

🤖 ML-Ready Dataset Preparation¶

Schema detection (features/labels)
Feature/target separation (X, y)
Train/validation/test split
Normalization & encoding (optional)
Export to CSV, JSON, Parquet
Ready-made loaders for scikit-learn, PyTorch, TensorFlow
Streaming mode for big data (generator-based)

🔒 Advanced Data Quality & Privacy¶

Data Quality Assessment: Missing data analysis, duplicate detection, outlier analysis, trust scoring
PII Detection: Automatic detection of personal identifiable information
Data Anonymization: Hash, mask, redact methods for privacy protection
Compliance Checking: GDPR, HIPAA compliance validation
Quality Scoring: Automated data quality metrics and reporting

🔍 Smart Search & Discovery¶

AI-Powered Search: Vector embeddings and semantic matching
Dataset Indexing: Automatic indexing with metadata and quality metrics
Multi-Platform Search: Kaggle, Google Dataset Search, Zenodo, DataCite integration
Relevance Ranking: Similarity scoring and quality-based ranking

☁️ Cloud Integration¶

Multi-Provider Support: AWS S3, Google Cloud Storage, Azure Blob Storage
Unified API: Single interface for all cloud providers
Auto-Detection: Automatic provider detection from URLs
Batch Operations: Upload/download multiple files

⚙️ Workflow Orchestration¶

YAML-Based Pipelines: Declarative workflow configuration
Conditional Branching: Dynamic execution based on data conditions
Error Handling: Robust error recovery and retry mechanisms
Async Execution: Parallel workflow execution

🎯 Active Learning & Sampling¶

Intelligent Sampling: Diversity, uncertainty, anomaly-based sampling
Stratified Sampling: Maintain class/label distributions
Quality-Based Sampling: Focus on data that improves quality
Active Learning: Iterative model improvement through targeted sampling

🚀 Distributed Processing¶

Ray Integration: Distributed computing with Ray
Dask Support: Large dataset processing with Dask
Parallel Pipelines: Concurrent data processing
Scalable Loading: Memory-efficient large file processing

🧠 ML Pipeline Integration¶

AutoML: Automated model selection and training
Feature Store: Centralized feature management
ML Data Preparation: One-click ML-ready data preparation
Model Evaluation: Automated model performance assessment

🛠️ Developer & User Tools¶

CLI tool (openmlcrawler fetch ...)
Config-driven pipelines (YAML/JSON configs)
Local caching system
Rate-limit + retry handling
Logging + progress bars
Dataset search: search_open_data("air quality")

📦 Installation¶

From PyPI

    pip install openmlcrawler
 ```

=== "From Source"

```bash
git clone https://github.com/krish567366/openmlcrawler.git
cd openmlcrawler
pip install -e .
    ```bash
    git clone https://github.com/krish567366/openmlcrawler.git
    cd openmlcrawler
    pip install -e .

With Development Dependencies

    pip install -e ".[dev]"

🎯 Use Cases¶

Data Science & ML Research¶

Dataset Discovery: Find and load datasets from multiple sources
Data Preparation: Clean and prepare data for ML pipelines
Experiment Tracking: Monitor data quality and lineage
Reproducible Research: Share data processing workflows

Business Intelligence¶

Market Research: Social media sentiment analysis
Competitor Analysis: Web scraping and data aggregation
Customer Insights: Survey data processing and analysis
Trend Analysis: Time series data from various sources

Academic Research¶

Data Collection: Automated data gathering from APIs
Data Integration: Combine data from multiple government sources
Quality Assurance: Automated data validation and cleaning
Publication Ready: Generate clean, well-documented datasets

🏗️ Architecture¶

graph TB
    A[Data Sources] --> B[Connectors]
    B --> C[Crawlers]
    C --> D[Parsers]
    D --> E[Cleaners]
    E --> F[Validators]
    F --> G[Exporters]

    H[CLI] --> B
    I[Web UI] --> B
    J[API] --> B

    K[Cloud Storage] --> G
    L[Databases] --> G
    M[ML Frameworks] --> G

🤝 Contributing¶

We welcome contributions! Please see our Contributing Guide for details.

📄 License¶

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support¶

Built with ❤️ by Krishna Bajpai & Vedanshi Gupta