OpenML Crawler¶
A unified framework for crawling and preparing ML-ready datasets from various sources including web APIs, open data portals, and custom data sources.
๐ Quick Start¶
from openmlcrawler import load_dataset
# Load weather data
df = load_dataset("weather", location="Delhi", days=7)
# Load Twitter data
df = load_dataset("twitter", query="machine learning", max_results=50)
# Load government data
df = load_dataset("us_gov", query="climate change", limit=20)
print(f"Loaded {len(df)} records")
โจ Features¶
๐ Connectors (Free APIs + Curated Data Sources)¶
- Weather: Open-Meteo, OpenWeather, NOAA
- Social Media: Twitter/X API, Reddit API, Facebook Graph API
- Government Data: US data.gov, EU Open Data, UK data.gov.uk, Indian data.gov.in
- Finance: Yahoo Finance, Alpha Vantage, CoinGecko
- Knowledge: Wikipedia, Wikidata
- News: GNews, Reddit, HackerNews
- Health: WHO, Johns Hopkins, FDA Open Data
- Agriculture: FAO, USDA, Government open data portals
- Energy: EIA, IEA
๐ท๏ธ Generic Web Crawling¶
- Support for CSV, JSON, XML, HTML parsing
- PDF parsing with pdfplumber/PyPDF2
- Async crawling with aiohttp
- Headless browser mode with Playwright/Selenium
- Auto format detection (mimetype, file extension)
๐งน Data Cleaning & Processing¶
- Deduplication and anomaly detection
- Missing value handling
- Auto type detection (int, float, datetime, category)
- Text cleaning (stopwords, stemming, lemmatization)
- NLP utilities: language detection, translation, NER
๐ค ML-Ready Dataset Preparation¶
- Schema detection (features/labels)
- Feature/target separation (
X
,y
) - Train/validation/test split
- Normalization & encoding (optional)
- Export to CSV, JSON, Parquet
- Ready-made loaders for scikit-learn, PyTorch, TensorFlow
- Streaming mode for big data (generator-based)
๐ Advanced Data Quality & Privacy¶
- Data Quality Assessment: Missing data analysis, duplicate detection, outlier analysis, trust scoring
- PII Detection: Automatic detection of personal identifiable information
- Data Anonymization: Hash, mask, redact methods for privacy protection
- Compliance Checking: GDPR, HIPAA compliance validation
- Quality Scoring: Automated data quality metrics and reporting
๐ Smart Search & Discovery¶
- AI-Powered Search: Vector embeddings and semantic matching
- Dataset Indexing: Automatic indexing with metadata and quality metrics
- Multi-Platform Search: Kaggle, Google Dataset Search, Zenodo, DataCite integration
- Relevance Ranking: Similarity scoring and quality-based ranking
โ๏ธ Cloud Integration¶
- Multi-Provider Support: AWS S3, Google Cloud Storage, Azure Blob Storage
- Unified API: Single interface for all cloud providers
- Auto-Detection: Automatic provider detection from URLs
- Batch Operations: Upload/download multiple files
โ๏ธ Workflow Orchestration¶
- YAML-Based Pipelines: Declarative workflow configuration
- Conditional Branching: Dynamic execution based on data conditions
- Error Handling: Robust error recovery and retry mechanisms
- Async Execution: Parallel workflow execution
๐ฏ Active Learning & Sampling¶
- Intelligent Sampling: Diversity, uncertainty, anomaly-based sampling
- Stratified Sampling: Maintain class/label distributions
- Quality-Based Sampling: Focus on data that improves quality
- Active Learning: Iterative model improvement through targeted sampling
๐ Distributed Processing¶
- Ray Integration: Distributed computing with Ray
- Dask Support: Large dataset processing with Dask
- Parallel Pipelines: Concurrent data processing
- Scalable Loading: Memory-efficient large file processing
๐ง ML Pipeline Integration¶
- AutoML: Automated model selection and training
- Feature Store: Centralized feature management
- ML Data Preparation: One-click ML-ready data preparation
- Model Evaluation: Automated model performance assessment
๐ ๏ธ Developer & User Tools¶
- CLI tool (
openmlcrawler fetch ...
) - Config-driven pipelines (YAML/JSON configs)
- Local caching system
- Rate-limit + retry handling
- Logging + progress bars
- Dataset search:
search_open_data("air quality")
๐ฆ Installation¶
pip install openmlcrawler
```
=== "From Source"
```bash
git clone https://github.com/krish567366/openmlcrawler.git
cd openmlcrawler
pip install -e .
```bash
git clone https://github.com/krish567366/openmlcrawler.git
cd openmlcrawler
pip install -e .
๐ฏ Use Cases¶
Data Science & ML Research¶
- Dataset Discovery: Find and load datasets from multiple sources
- Data Preparation: Clean and prepare data for ML pipelines
- Experiment Tracking: Monitor data quality and lineage
- Reproducible Research: Share data processing workflows
Business Intelligence¶
- Market Research: Social media sentiment analysis
- Competitor Analysis: Web scraping and data aggregation
- Customer Insights: Survey data processing and analysis
- Trend Analysis: Time series data from various sources
Academic Research¶
- Data Collection: Automated data gathering from APIs
- Data Integration: Combine data from multiple government sources
- Quality Assurance: Automated data validation and cleaning
- Publication Ready: Generate clean, well-documented datasets
๐๏ธ Architecture¶
graph TB
A[Data Sources] --> B[Connectors]
B --> C[Crawlers]
C --> D[Parsers]
D --> E[Cleaners]
E --> F[Validators]
F --> G[Exporters]
H[CLI] --> B
I[Web UI] --> B
J[API] --> B
K[Cloud Storage] --> G
L[Databases] --> G
M[ML Frameworks] --> G
๐ค Contributing¶
We welcome contributions! Please see our Contributing Guide for details.
๐ License¶
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Support¶
- ๐ Documentation
- ๐ Issue Tracker
- ๐ฌ Discussions
- ๐ง Email Support
Built with โค๏ธ by Krishna Bajpai & Vedanshi Gupta