Skip to content

Social Media Data Connectors

OpenML Crawler provides comprehensive social media data connectors that can access real-time and historical data from major social media platforms. These connectors support various data types including posts, comments, user profiles, hashtags, and engagement metrics.

Supported Platforms

Twitter API

Real-time social media data from Twitter (now X) with comprehensive access to tweets, users, and trends.

Features:

  • Real-time tweet streaming
  • Historical tweet search
  • User profile data
  • Hashtag and keyword tracking
  • Engagement metrics (likes, retweets, replies)
  • Geolocation data
  • Media content (images, videos)

Configuration:

connectors:
  social_media:
    twitter:
      bearer_token: "${TWITTER_BEARER_TOKEN}"
      api_key: "${TWITTER_API_KEY}"
      api_secret: "${TWITTER_API_SECRET}"
      access_token: "${TWITTER_ACCESS_TOKEN}"
      access_secret: "${TWITTER_ACCESS_SECRET}"
      rate_limit_buffer: 0.8

Usage:

from openmlcrawler.connectors.social_media import TwitterConnector

connector = TwitterConnector(bearer_token="your_token")

# Search tweets
tweets = connector.search_tweets(
    query="#machinelearning",
    max_results=100,
    start_time="2023-01-01T00:00:00Z"
)

# Get user timeline
timeline = connector.get_user_timeline(
    username="elonmusk",
    max_results=50
)

# Stream real-time tweets
stream = connector.stream_tweets(
    keywords=["AI", "machine learning"],
    languages=["en"]
)

Reddit API

Community-driven content from Reddit with access to posts, comments, and subreddit data.

Features:

  • Subreddit post data
  • Comment threads and replies
  • User profile information
  • Subreddit statistics
  • Real-time post streaming
  • Search functionality
  • Moderation data

Configuration:

connectors:
  social_media:
    reddit:
      client_id: "${REDDIT_CLIENT_ID}"
      client_secret: "${REDDIT_CLIENT_SECRET}"
      user_agent: "OpenMLCrawler/1.0"
      username: "${REDDIT_USERNAME}"  # Optional
      password: "${REDDIT_PASSWORD}"  # Optional

Usage:

from openmlcrawler.connectors.social_media import RedditConnector

connector = RedditConnector(
    client_id="your_client_id",
    client_secret="your_client_secret",
    user_agent="OpenMLCrawler/1.0"
)

# Get subreddit posts
posts = connector.get_subreddit_posts(
    subreddit="MachineLearning",
    sort="hot",
    limit=100
)

# Search Reddit
results = connector.search_reddit(
    query="artificial intelligence",
    subreddit="all",
    sort="relevance"
)

# Get post comments
comments = connector.get_post_comments(
    post_id="abc123",
    sort="top"
)

Facebook Graph API

Social media analytics from Facebook with access to pages, posts, and user data.

Features:

  • Facebook page data
  • Post engagement metrics
  • User profile information
  • Event data
  • Group content
  • Advertising insights
  • Video analytics

Configuration:

connectors:
  social_media:
    facebook:
      app_id: "${FACEBOOK_APP_ID}"
      app_secret: "${FACEBOOK_APP_SECRET}"
      access_token: "${FACEBOOK_ACCESS_TOKEN}"
      version: "v18.0"

Usage:

from openmlcrawler.connectors.social_media import FacebookConnector

connector = FacebookConnector(
    app_id="your_app_id",
    app_secret="your_app_secret",
    access_token="your_access_token"
)

# Get page posts
posts = connector.get_page_posts(
    page_id="20531316728",  # Facebook's page
    limit=50
)

# Get post insights
insights = connector.get_post_insights(
    post_id="123456789",
    metrics=["likes", "comments", "shares", "reach"]
)

Instagram API

Visual content data from Instagram with access to media, stories, and user profiles.

Features:

  • Media content (photos, videos, reels)
  • User profile data
  • Hashtag content
  • Story data
  • Comment and like data
  • Business account insights
  • Location-based content

Configuration:

connectors:
  social_media:
    instagram:
      access_token: "${INSTAGRAM_ACCESS_TOKEN}"
      app_id: "${INSTAGRAM_APP_ID}"
      app_secret: "${INSTAGRAM_APP_SECRET}"

Usage:

from openmlcrawler.connectors.social_media import InstagramConnector

connector = InstagramConnector(access_token="your_token")

# Get user media
media = connector.get_user_media(
    user_id="123456789",
    limit=20
)

# Search hashtags
hashtag_media = connector.get_hashtag_media(
    hashtag="machinelearning",
    limit=50
)

Data Types and Parameters

Common Data Fields

Field Description Platforms
content Text content of the post All
author User/profile information All
timestamp Creation timestamp All
engagement Likes, shares, comments count All
media Images, videos, links All
location Geographic location data Twitter, Facebook
hashtags Hashtag usage All
mentions User mentions Twitter, Instagram
sentiment Content sentiment analysis All (processed)

Platform-Specific Fields

Twitter:

  • Retweet count
  • Quote tweet data
  • Thread information
  • Tweet entities (URLs, mentions, hashtags)
  • Conversation ID

Reddit:

  • Subreddit name
  • Post score (upvotes - downvotes)
  • Comment nesting level
  • Post flair
  • Award information

Facebook:

  • Post type (status, photo, video, link)
  • Privacy settings
  • Target audience
  • Sponsored content flags

Instagram:

  • Media type (photo, video, carousel, reel)
  • Filter information
  • Tagged users
  • Story highlights

Data Collection Strategies

Real-time Streaming

from openmlcrawler.connectors.social_media import SocialMediaStreamer

streamer = SocialMediaStreamer()

# Configure streaming
config = {
    "twitter": {
        "keywords": ["AI", "machine learning"],
        "languages": ["en", "es", "fr"]
    },
    "reddit": {
        "subreddits": ["MachineLearning", "artificial"]
    }
}

# Start streaming
streamer.start_streaming(config, callback=process_social_data)

Historical Data Collection

from openmlcrawler.connectors.social_media import HistoricalSocialCollector

collector = HistoricalSocialCollector()

# Collect Twitter history
twitter_data = collector.collect_twitter_history(
    query="#AI",
    start_date="2023-01-01",
    end_date="2023-12-31",
    max_results=10000
)

# Collect Reddit history
reddit_data = collector.collect_reddit_history(
    subreddit="MachineLearning",
    sort="top",
    time_filter="year",
    limit=1000
)

Batch Processing

from openmlcrawler.connectors.social_media import BatchSocialProcessor

processor = BatchSocialProcessor()

# Process multiple platforms
results = processor.process_batch(
    platforms=["twitter", "reddit", "facebook"],
    queries={
        "twitter": ["#AI", "#MachineLearning"],
        "reddit": ["artificial intelligence"],
        "facebook": ["AI technology"]
    },
    date_range=("2023-01-01", "2023-12-31")
)

Data Quality and Validation

Quality Checks

  1. Content Validation: Check for spam, bots, and low-quality content
  2. Duplicate Detection: Identify duplicate posts and users
  3. Language Detection: Verify content language matches specified filters
  4. Geographic Validation: Validate location data accuracy
  5. Engagement Authenticity: Detect artificial engagement patterns

Rate Limiting and Throttling

from openmlcrawler.connectors.social_media import RateLimitedConnector

connector = RateLimitedConnector(
    platform="twitter",
    requests_per_minute=300,
    burst_limit=50
)

# Automatic rate limiting
data = connector.get_data_with_rate_limit(query="AI")

Sentiment Analysis Integration

Built-in Sentiment Analysis

from openmlcrawler.connectors.social_media import SentimentAnalyzer

analyzer = SentimentAnalyzer()

# Analyze sentiment
results = analyzer.analyze_sentiment(
    social_data=data,
    model="vader",  # vader, textblob, transformers
    languages=["en", "es", "fr"]
)

# Get sentiment trends
trends = analyzer.get_sentiment_trends(
    data=results,
    time_window="daily",
    group_by="platform"
)

Custom Sentiment Models

from openmlcrawler.connectors.social_media import CustomSentimentAnalyzer

analyzer = CustomSentimentAnalyzer(
    model_path="path/to/custom/model",
    tokenizer_path="path/to/tokenizer"
)

# Analyze with custom model
sentiment_scores = analyzer.analyze_custom(
    texts=social_texts,
    batch_size=32
)

Privacy and Compliance

Data Privacy Considerations

  1. User Consent: Respect platform privacy policies
  2. Data Minimization: Collect only necessary data fields
  3. Anonymization: Remove personally identifiable information
  4. Retention Policies: Implement data retention limits
  5. Access Controls: Secure API credentials and data access

GDPR Compliance

from openmlcrawler.connectors.social_media import GDPRCompliantConnector

connector = GDPRCompliantConnector(
    enable_anonymization=True,
    retention_days=90,
    consent_required=True
)

# GDPR-compliant data collection
data = connector.collect_gdpr_compliant(
    query="AI research",
    anonymize_pii=True,
    consent_verified=True
)

Configuration Options

Global Configuration

social_media_connectors:
  default_platforms: ["twitter", "reddit"]
  rate_limiting:
    global_requests_per_minute: 1000
    per_platform_limits:
      twitter: 300
      reddit: 600
      facebook: 200
      instagram: 200
  data_quality:
    enable_spam_filter: true
    enable_duplicate_detection: true
    min_content_length: 10
  privacy:
    enable_anonymization: true
    retention_days: 90
    gdpr_compliance: true

Platform-Specific Settings

twitter:
  enable_premium_api: false
  include_retweets: true
  include_replies: false
  tweet_mode: "extended"

reddit:
  enable_praw: true
  subreddit_limit: 100
  comment_depth: 5

facebook:
  enable_insights: true
  page_access_token_required: true
  include_video_insights: false

Best Practices

Performance Optimization

  1. Selective Data Collection: Only collect relevant fields
  2. Efficient Queries: Use platform-specific query optimization
  3. Caching Strategy: Cache frequently accessed data
  4. Batch Operations: Combine multiple requests when possible

Cost Management

  1. API Tier Selection: Choose appropriate API tiers
  2. Usage Monitoring: Track API usage and costs
  3. Fallback Strategies: Use free tiers for development
  4. Data Sampling: Sample data for analysis instead of full collection

Data Reliability

  1. Multiple Platforms: Use multiple social platforms for validation
  2. Error Recovery: Implement robust error handling and retries
  3. Data Validation: Comprehensive data quality checks
  4. Monitoring: Monitor collection health and data quality

Troubleshooting

Common Issues

Authentication Errors

Error: Invalid API credentials
Solution: Verify API keys, tokens, and permissions

Rate Limiting

Error: API rate limit exceeded
Solution: Implement exponential backoff and reduce request frequency

Permission Denied

Error: Insufficient permissions
Solution: Check API scopes and request necessary permissions

Data Unavailable

Error: Content not available
Solution: Content may be deleted, private, or restricted

See Also