Health Data Connectors¶
OpenML Crawler provides specialized health data connectors that access medical, clinical, and public health data from various authoritative sources. These connectors handle sensitive health information with proper privacy controls and compliance measures.
Supported Health Data Sources¶
CDC (Centers for Disease Control and Prevention)¶
Official US public health data from the Centers for Disease Control and Prevention.
Features:
- Disease surveillance data
- Vaccination statistics
- Health outcome metrics
- Demographic health data
- Environmental health indicators
- Injury and violence data
- Chronic disease statistics
Usage:
from openmlcrawler.connectors.health import CDCConnector
connector = CDCConnector()
# Get disease surveillance data
surveillance_data = connector.get_disease_data(
disease="COVID-19",
geography="United States",
date_range=("2020-01-01", "2023-12-31")
)
# Get vaccination statistics
vaccination_data = connector.get_vaccination_data(
vaccine_type="COVID-19",
geography="states",
date_range=("2021-01-01", "2023-12-31")
)
# Get health outcome metrics
outcomes = connector.get_health_outcomes(
indicators=["mortality", "morbidity"],
demographics=["age", "gender", "race"]
)
WHO (World Health Organization)¶
Global health data and statistics from the World Health Organization.
Features:
- Global health indicators
- Disease outbreak data
- Health system performance
- Mortality and morbidity statistics
- Environmental health data
- Health workforce data
- Universal health coverage metrics
Configuration:
connectors:
health:
who:
base_url: "https://ghoapi.who.int/api"
format: "json"
language: "en"
cache_enabled: true
Usage:
from openmlcrawler.connectors.health import WHOConnector
connector = WHOConnector()
# Get global health indicators
indicators = connector.get_health_indicators(
indicator_codes=["WHOSIS_000001", "WHOSIS_000002"],
countries=["USA", "GBR", "DEU"],
year_range=(2015, 2023)
)
# Get disease outbreak data
outbreaks = connector.get_outbreak_data(
disease="Ebola",
region="AFRICA",
date_range=("2014-01-01", "2023-12-31")
)
# Get health system data
health_systems = connector.get_health_system_data(
countries=["USA", "CAN", "GBR"],
indicators=["health_expenditure", "physicians_per_1000"]
)
PubMed¶
Medical research literature from the National Library of Medicine's PubMed database.
Features:
- Medical research articles
- Clinical trial data
- Systematic reviews
- Meta-analyses
- Case studies
- Medical guidelines
- Drug information
Usage:
from openmlcrawler.connectors.health import PubMedConnector
connector = PubMedConnector(api_key="your_key", email="your_email")
# Search medical literature
articles = connector.search_articles(
query="machine learning AND medical diagnosis",
max_results=100,
date_range=("2020/01/01", "2023/12/31")
)
# Get article details
article_details = connector.get_article_details(
pmids=["12345678", "87654321"]
)
# Search clinical trials
trials = connector.search_clinical_trials(
condition="diabetes",
intervention="machine learning",
status="completed"
)
# Get systematic reviews
reviews = connector.get_systematic_reviews(
topic="AI in healthcare",
max_results=50
)
ClinicalTrials.gov¶
Clinical trial registry maintained by the US National Library of Medicine.
Features:
- Clinical trial protocols
- Trial status and results
- Patient recruitment data
- Study designs and outcomes
- Investigator information
- Sponsor details
- Regulatory information
Usage:
from openmlcrawler.connectors.health import ClinicalTrialsConnector
connector = ClinicalTrialsConnector()
# Search clinical trials
trials = connector.search_trials(
condition="cancer",
intervention="immunotherapy",
status="Recruiting",
max_results=100
)
# Get trial details
trial_details = connector.get_trial_details(
nct_ids=["NCT12345678", "NCT87654321"]
)
# Get recruitment data
recruitment = connector.get_recruitment_data(
condition="COVID-19",
location="United States"
)
# Search by sponsor
sponsor_trials = connector.search_by_sponsor(
sponsor="Pfizer",
condition="COVID-19"
)
Data Types and Categories¶
Public Health Data¶
Category | Description | Sources | Update Frequency |
---|---|---|---|
Disease Surveillance | Outbreak tracking and monitoring | CDC, WHO | Daily/Weekly |
Vital Statistics | Births, deaths, life expectancy | CDC, WHO | Monthly/Yearly |
Health Behaviors | Smoking, diet, exercise patterns | CDC | Annual |
Health Outcomes | Disease incidence and prevalence | CDC, WHO | Quarterly |
Healthcare Access | Insurance coverage, provider access | CDC | Annual |
Environmental Health | Air/water quality health impacts | CDC, WHO | Monthly |
Clinical Research Data¶
- Clinical Trials: Study protocols, results, and outcomes
- Medical Literature: Research articles, reviews, and meta-analyses
- Drug Information: Medication data, interactions, and approvals
- Medical Devices: Device approvals, recalls, and safety data
- Healthcare Quality: Hospital performance and patient outcomes
- Health Economics: Cost-effectiveness and resource allocation
Genomic and Biomarker Data¶
- Genetic Data: Gene variants and associations
- Biomarker Data: Diagnostic and prognostic markers
- Pharmacogenomics: Genetic factors in drug response
- Rare Diseases: Orphan disease data and research
- Population Genomics: Large-scale genetic studies
Data Collection Strategies¶
Public Health Surveillance¶
from openmlcrawler.connectors.health import HealthSurveillanceCollector
collector = HealthSurveillanceCollector()
# Collect disease surveillance data
surveillance_data = collector.collect_surveillance_data(
diseases=["COVID-19", "Influenza", "Ebola"],
regions=["GLOBAL", "AMERICA", "EUROPE"],
date_range=("2020-01-01", "2023-12-31")
)
# Monitor health indicators
indicators = collector.monitor_health_indicators(
indicators=["mortality_rate", "vaccination_coverage"],
countries=["USA", "CAN", "MEX"],
frequency="weekly"
)
# Generate health reports
reports = collector.generate_health_reports(
data=surveillance_data,
report_types=["epidemiological", "trends", "forecasts"]
)
Collecting Clinical Research Data¶
from openmlcrawler.connectors.health import ClinicalResearchCollector
collector = ClinicalResearchCollector()
# Collect clinical trial data
trial_data = collector.collect_trial_data(
conditions=["cancer", "diabetes", "cardiovascular"],
phases=["Phase 2", "Phase 3", "Phase 4"],
status=["Completed", "Recruiting"],
max_results=1000
)
# Collect medical literature
literature = collector.collect_medical_literature(
topics=["machine learning", "AI", "deep learning"],
publication_types=["systematic review", "randomized trial"],
date_range=("2020-01-01", "2023-12-31")
)
# Analyze research trends
trends = collector.analyze_research_trends(
data=literature,
analysis_types=["publication_trends", "collaboration_networks"]
)
Batch Processing¶
from openmlcrawler.connectors.health import BatchHealthProcessor
processor = BatchHealthProcessor()
# Process multiple health data sources
results = processor.process_batch(
sources=["cdc", "who", "pubmed"],
categories=["surveillance", "clinical_trials", "literature"],
date_range=("2020-01-01", "2023-12-31"),
output_format="parquet"
)
Privacy and Compliance¶
HIPAA Compliance¶
Health Insurance Portability and Accountability Act compliance for US health data.
from openmlcrawler.connectors.health import HIPAACompliantConnector
connector = HIPAACompliantConnector(
enable_deidentification=True,
audit_logging=True,
access_controls=True
)
# Collect HIPAA-compliant data
compliant_data = connector.collect_hipaa_compliant(
data_sources=["clinical_trials", "medical_records"],
deidentification_level="safe_harbor",
purpose="research"
)
# Audit data access
audit_log = connector.get_audit_log(
date_range=("2023-01-01", "2023-12-31"),
user_id="researcher_123"
)
GDPR Compliance¶
General Data Protection Regulation compliance for EU health data.
from openmlcrawler.connectors.health import GDPRCompliantConnector
connector = GDPRCompliantConnector(
enable_consent_management=True,
data_minimization=True,
retention_limits=True
)
# Collect GDPR-compliant data
gdpr_data = connector.collect_gdpr_compliant(
data_sources=["who", "eu_health_systems"],
consent_verified=True,
processing_purpose="public_health_research"
)
# Manage data subject rights
rights_response = connector.handle_data_subject_request(
request_type="access",
subject_id="patient_456",
justification="medical_research"
)
Data De-identification¶
from openmlcrawler.connectors.health import HealthDataDeidentifier
deidentifier = HealthDataDeidentifier()
# De-identify health data
deidentified_data = deidentifier.deidentify_data(
data=health_data,
methods=[
"remove_direct_identifiers",
"generalize_dates",
"mask_sensitive_values"
],
risk_threshold="very_low"
)
# Validate de-identification
validation = deidentifier.validate_deidentification(
original_data=health_data,
deidentified_data=deidentified_data,
reidentification_risk_threshold=0.0001
)
Data Quality and Validation¶
Quality Assurance¶
- Source Verification: Validate data source authenticity and authority
- Data Completeness: Check for missing values and data gaps
- Temporal Consistency: Verify chronological data ordering
- Cross-Source Validation: Compare data across multiple authoritative sources
- Statistical Validation: Check for outliers and data anomalies
Validation Framework¶
from openmlcrawler.connectors.health import HealthDataValidator
validator = HealthDataValidator()
# Validate health data
validation_result = validator.validate_health_data(
data=health_data,
checks=[
"source_authority",
"data_completeness",
"temporal_consistency",
"statistical_validity",
"privacy_compliance"
]
)
# Generate quality report
quality_report = validator.generate_quality_report(
validation_results=validation_result,
include_recommendations=True,
compliance_frameworks=["HIPAA", "GDPR"]
)
Integration with ML Pipelines¶
Disease Prediction Models¶
from openmlcrawler.connectors.health import DiseasePredictionPipeline
pipeline = DiseasePredictionPipeline()
# Build prediction model
model = pipeline.build_prediction_model(
training_data=surveillance_data,
target_diseases=["COVID-19", "Influenza"],
features=[
"demographic_data",
"environmental_factors",
"healthcare_capacity",
"vaccination_rates"
],
model_type="xgboost"
)
# Generate predictions
predictions = pipeline.generate_predictions(
model=model,
input_data=current_data,
prediction_horizon=30, # days
confidence_intervals=True
)
# Validate predictions
validation = pipeline.validate_predictions(
predictions=predictions,
actual_data=validation_data,
metrics=["accuracy", "precision", "recall", "auc"]
)
Clinical Trial Analysis¶
from openmlcrawler.connectors.health import ClinicalTrialAnalyzer
analyzer = ClinicalTrialAnalyzer()
# Analyze trial outcomes
outcomes = analyzer.analyze_trial_outcomes(
trial_data=clinical_trials,
outcome_measures=[
"efficacy",
"safety",
"adverse_events",
"patient_reported_outcomes"
]
)
# Identify successful interventions
successful_trials = analyzer.identify_successful_interventions(
trials=outcomes,
success_criteria={
"efficacy_threshold": 0.7,
"safety_threshold": 0.8,
"sample_size_min": 100
}
)
# Generate research insights
insights = analyzer.generate_research_insights(
trial_data=outcomes,
insight_types=[
"treatment_effectiveness",
"adverse_event_patterns",
"patient_subgroup_analysis"
]
)
Configuration Options¶
Global Configuration¶
health_connectors:
default_sources: ["cdc", "who"]
privacy_compliance:
enable_hipaa: true
enable_gdpr: true
deidentification_required: true
data_quality:
enable_validation: true
strict_mode: true
authority_check: true
caching:
enable_cache: true
cache_ttl_hours: 6
max_cache_size_gb: 50
rate_limiting:
requests_per_minute: 30
burst_limit: 10
Source-Specific Settings¶
cdc:
base_url: "https://data.cdc.gov"
api_key: "${CDC_API_KEY}"
format: "json"
timeout_seconds: 30
who:
base_url: "https://ghoapi.who.int/api"
format: "json"
language: "en"
cache_enabled: true
pubmed:
api_key: "${PUBMED_API_KEY}"
email: "${PUBMED_EMAIL}"
tool: "OpenMLCrawler"
max_results_per_query: 100
clinicaltrials:
base_url: "https://clinicaltrials.gov/api"
format: "json"
timeout_seconds: 30
max_results: 1000
Best Practices¶
Data Privacy and Security¶
- De-identification: Always de-identify personal health information
- Access Controls: Implement strict access controls and audit logging
- Data Minimization: Collect only necessary health data fields
- Retention Limits: Implement data retention policies
- Secure Transmission: Use encrypted connections for data transfer
Data Quality Management¶
- Source Verification: Use only authoritative health data sources
- Cross-Validation: Validate data against multiple sources
- Regular Updates: Monitor data freshness and update frequencies
- Error Detection: Implement automated error detection and correction
- Documentation: Maintain comprehensive data lineage and metadata
Performance Optimization¶
- Intelligent Caching: Cache health data appropriately (short TTL)
- Batch Processing: Use batch operations for large datasets
- Selective Queries: Query only required data fields and time ranges
- Rate Limiting: Respect API rate limits and implement backoff strategies
- Data Compression: Compress large health datasets for storage
Troubleshooting¶
Common Issues¶
Privacy Compliance¶
Data Access Restrictions¶
Data Quality Issues¶
API Rate Limiting¶
See Also¶
- Connectors Overview - Overview of all data connectors
- Data Processing - Processing health data
- Quality & Privacy - Health data quality and privacy
- API Reference - Health connector API
- Tutorials - Health data tutorials