Skip to content

OpenML Crawler

Python 3.8+ License: MIT Documentation DOI

A unified framework for crawling and preparing ML-ready datasets from various sources including web APIs, open data portals, and custom data sources.

๐Ÿš€ Quick Start

from openmlcrawler import load_dataset

# Load weather data
df = load_dataset("weather", location="Delhi", days=7)

# Load Twitter data
df = load_dataset("twitter", query="machine learning", max_results=50)

# Load government data
df = load_dataset("us_gov", query="climate change", limit=20)

print(f"Loaded {len(df)} records")

โœจ Features

๐Ÿ”Œ Connectors (Free APIs + Curated Data Sources)

  • Weather: Open-Meteo, OpenWeather, NOAA
  • Social Media: Twitter/X API, Reddit API, Facebook Graph API
  • Government Data: US data.gov, EU Open Data, UK data.gov.uk, Indian data.gov.in
  • Finance: Yahoo Finance, Alpha Vantage, CoinGecko
  • Knowledge: Wikipedia, Wikidata
  • News: GNews, Reddit, HackerNews
  • Health: WHO, Johns Hopkins, FDA Open Data
  • Agriculture: FAO, USDA, Government open data portals
  • Energy: EIA, IEA

๐Ÿ•ท๏ธ Generic Web Crawling

  • Support for CSV, JSON, XML, HTML parsing
  • PDF parsing with pdfplumber/PyPDF2
  • Async crawling with aiohttp
  • Headless browser mode with Playwright/Selenium
  • Auto format detection (mimetype, file extension)

๐Ÿงน Data Cleaning & Processing

  • Deduplication and anomaly detection
  • Missing value handling
  • Auto type detection (int, float, datetime, category)
  • Text cleaning (stopwords, stemming, lemmatization)
  • NLP utilities: language detection, translation, NER

๐Ÿค– ML-Ready Dataset Preparation

  • Schema detection (features/labels)
  • Feature/target separation (X, y)
  • Train/validation/test split
  • Normalization & encoding (optional)
  • Export to CSV, JSON, Parquet
  • Ready-made loaders for scikit-learn, PyTorch, TensorFlow
  • Streaming mode for big data (generator-based)

๐Ÿ”’ Advanced Data Quality & Privacy

  • Data Quality Assessment: Missing data analysis, duplicate detection, outlier analysis, trust scoring
  • PII Detection: Automatic detection of personal identifiable information
  • Data Anonymization: Hash, mask, redact methods for privacy protection
  • Compliance Checking: GDPR, HIPAA compliance validation
  • Quality Scoring: Automated data quality metrics and reporting

๐Ÿ” Smart Search & Discovery

  • AI-Powered Search: Vector embeddings and semantic matching
  • Dataset Indexing: Automatic indexing with metadata and quality metrics
  • Multi-Platform Search: Kaggle, Google Dataset Search, Zenodo, DataCite integration
  • Relevance Ranking: Similarity scoring and quality-based ranking

โ˜๏ธ Cloud Integration

  • Multi-Provider Support: AWS S3, Google Cloud Storage, Azure Blob Storage
  • Unified API: Single interface for all cloud providers
  • Auto-Detection: Automatic provider detection from URLs
  • Batch Operations: Upload/download multiple files

โš™๏ธ Workflow Orchestration

  • YAML-Based Pipelines: Declarative workflow configuration
  • Conditional Branching: Dynamic execution based on data conditions
  • Error Handling: Robust error recovery and retry mechanisms
  • Async Execution: Parallel workflow execution

๐ŸŽฏ Active Learning & Sampling

  • Intelligent Sampling: Diversity, uncertainty, anomaly-based sampling
  • Stratified Sampling: Maintain class/label distributions
  • Quality-Based Sampling: Focus on data that improves quality
  • Active Learning: Iterative model improvement through targeted sampling

๐Ÿš€ Distributed Processing

  • Ray Integration: Distributed computing with Ray
  • Dask Support: Large dataset processing with Dask
  • Parallel Pipelines: Concurrent data processing
  • Scalable Loading: Memory-efficient large file processing

๐Ÿง  ML Pipeline Integration

  • AutoML: Automated model selection and training
  • Feature Store: Centralized feature management
  • ML Data Preparation: One-click ML-ready data preparation
  • Model Evaluation: Automated model performance assessment

๐Ÿ› ๏ธ Developer & User Tools

  • CLI tool (openmlcrawler fetch ...)
  • Config-driven pipelines (YAML/JSON configs)
  • Local caching system
  • Rate-limit + retry handling
  • Logging + progress bars
  • Dataset search: search_open_data("air quality")

๐Ÿ“ฆ Installation

    pip install openmlcrawler
 ```

=== "From Source"

```bash
git clone https://github.com/krish567366/openmlcrawler.git
cd openmlcrawler
pip install -e .
    ```bash
    git clone https://github.com/krish567366/openmlcrawler.git
    cd openmlcrawler
    pip install -e .
    pip install -e ".[dev]"

๐ŸŽฏ Use Cases

Data Science & ML Research

  • Dataset Discovery: Find and load datasets from multiple sources
  • Data Preparation: Clean and prepare data for ML pipelines
  • Experiment Tracking: Monitor data quality and lineage
  • Reproducible Research: Share data processing workflows

Business Intelligence

  • Market Research: Social media sentiment analysis
  • Competitor Analysis: Web scraping and data aggregation
  • Customer Insights: Survey data processing and analysis
  • Trend Analysis: Time series data from various sources

Academic Research

  • Data Collection: Automated data gathering from APIs
  • Data Integration: Combine data from multiple government sources
  • Quality Assurance: Automated data validation and cleaning
  • Publication Ready: Generate clean, well-documented datasets

๐Ÿ—๏ธ Architecture

graph TB
    A[Data Sources] --> B[Connectors]
    B --> C[Crawlers]
    C --> D[Parsers]
    D --> E[Cleaners]
    E --> F[Validators]
    F --> G[Exporters]

    H[CLI] --> B
    I[Web UI] --> B
    J[API] --> B

    K[Cloud Storage] --> G
    L[Databases] --> G
    M[ML Frameworks] --> G

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ“ž Support


Built with โค๏ธ by Krishna Bajpai & Vedanshi Gupta