Social Media Data Connectors¶
OpenML Crawler provides comprehensive social media data connectors that can access real-time and historical data from major social media platforms. These connectors support various data types including posts, comments, user profiles, hashtags, and engagement metrics.
Supported Platforms¶
Twitter API¶
Real-time social media data from Twitter (now X) with comprehensive access to tweets, users, and trends.
Features:
- Real-time tweet streaming
- Historical tweet search
- User profile data
- Hashtag and keyword tracking
- Engagement metrics (likes, retweets, replies)
- Geolocation data
- Media content (images, videos)
Configuration:
connectors:
social_media:
twitter:
bearer_token: "${TWITTER_BEARER_TOKEN}"
api_key: "${TWITTER_API_KEY}"
api_secret: "${TWITTER_API_SECRET}"
access_token: "${TWITTER_ACCESS_TOKEN}"
access_secret: "${TWITTER_ACCESS_SECRET}"
rate_limit_buffer: 0.8
Usage:
from openmlcrawler.connectors.social_media import TwitterConnector
connector = TwitterConnector(bearer_token="your_token")
# Search tweets
tweets = connector.search_tweets(
query="#machinelearning",
max_results=100,
start_time="2023-01-01T00:00:00Z"
)
# Get user timeline
timeline = connector.get_user_timeline(
username="elonmusk",
max_results=50
)
# Stream real-time tweets
stream = connector.stream_tweets(
keywords=["AI", "machine learning"],
languages=["en"]
)
Reddit API¶
Community-driven content from Reddit with access to posts, comments, and subreddit data.
Features:
- Subreddit post data
- Comment threads and replies
- User profile information
- Subreddit statistics
- Real-time post streaming
- Search functionality
- Moderation data
Configuration:
connectors:
social_media:
reddit:
client_id: "${REDDIT_CLIENT_ID}"
client_secret: "${REDDIT_CLIENT_SECRET}"
user_agent: "OpenMLCrawler/1.0"
username: "${REDDIT_USERNAME}" # Optional
password: "${REDDIT_PASSWORD}" # Optional
Usage:
from openmlcrawler.connectors.social_media import RedditConnector
connector = RedditConnector(
client_id="your_client_id",
client_secret="your_client_secret",
user_agent="OpenMLCrawler/1.0"
)
# Get subreddit posts
posts = connector.get_subreddit_posts(
subreddit="MachineLearning",
sort="hot",
limit=100
)
# Search Reddit
results = connector.search_reddit(
query="artificial intelligence",
subreddit="all",
sort="relevance"
)
# Get post comments
comments = connector.get_post_comments(
post_id="abc123",
sort="top"
)
Facebook Graph API¶
Social media analytics from Facebook with access to pages, posts, and user data.
Features:
- Facebook page data
- Post engagement metrics
- User profile information
- Event data
- Group content
- Advertising insights
- Video analytics
Configuration:
connectors:
social_media:
facebook:
app_id: "${FACEBOOK_APP_ID}"
app_secret: "${FACEBOOK_APP_SECRET}"
access_token: "${FACEBOOK_ACCESS_TOKEN}"
version: "v18.0"
Usage:
from openmlcrawler.connectors.social_media import FacebookConnector
connector = FacebookConnector(
app_id="your_app_id",
app_secret="your_app_secret",
access_token="your_access_token"
)
# Get page posts
posts = connector.get_page_posts(
page_id="20531316728", # Facebook's page
limit=50
)
# Get post insights
insights = connector.get_post_insights(
post_id="123456789",
metrics=["likes", "comments", "shares", "reach"]
)
Instagram API¶
Visual content data from Instagram with access to media, stories, and user profiles.
Features:
- Media content (photos, videos, reels)
- User profile data
- Hashtag content
- Story data
- Comment and like data
- Business account insights
- Location-based content
Configuration:
connectors:
social_media:
instagram:
access_token: "${INSTAGRAM_ACCESS_TOKEN}"
app_id: "${INSTAGRAM_APP_ID}"
app_secret: "${INSTAGRAM_APP_SECRET}"
Usage:
from openmlcrawler.connectors.social_media import InstagramConnector
connector = InstagramConnector(access_token="your_token")
# Get user media
media = connector.get_user_media(
user_id="123456789",
limit=20
)
# Search hashtags
hashtag_media = connector.get_hashtag_media(
hashtag="machinelearning",
limit=50
)
Data Types and Parameters¶
Common Data Fields¶
Field | Description | Platforms |
---|---|---|
content | Text content of the post | All |
author | User/profile information | All |
timestamp | Creation timestamp | All |
engagement | Likes, shares, comments count | All |
media | Images, videos, links | All |
location | Geographic location data | Twitter, Facebook |
hashtags | Hashtag usage | All |
mentions | User mentions | Twitter, Instagram |
sentiment | Content sentiment analysis | All (processed) |
Platform-Specific Fields¶
Twitter:
- Retweet count
- Quote tweet data
- Thread information
- Tweet entities (URLs, mentions, hashtags)
- Conversation ID
Reddit:
- Subreddit name
- Post score (upvotes - downvotes)
- Comment nesting level
- Post flair
- Award information
Facebook:
- Post type (status, photo, video, link)
- Privacy settings
- Target audience
- Sponsored content flags
Instagram:
- Media type (photo, video, carousel, reel)
- Filter information
- Tagged users
- Story highlights
Data Collection Strategies¶
Real-time Streaming¶
from openmlcrawler.connectors.social_media import SocialMediaStreamer
streamer = SocialMediaStreamer()
# Configure streaming
config = {
"twitter": {
"keywords": ["AI", "machine learning"],
"languages": ["en", "es", "fr"]
},
"reddit": {
"subreddits": ["MachineLearning", "artificial"]
}
}
# Start streaming
streamer.start_streaming(config, callback=process_social_data)
Historical Data Collection¶
from openmlcrawler.connectors.social_media import HistoricalSocialCollector
collector = HistoricalSocialCollector()
# Collect Twitter history
twitter_data = collector.collect_twitter_history(
query="#AI",
start_date="2023-01-01",
end_date="2023-12-31",
max_results=10000
)
# Collect Reddit history
reddit_data = collector.collect_reddit_history(
subreddit="MachineLearning",
sort="top",
time_filter="year",
limit=1000
)
Batch Processing¶
from openmlcrawler.connectors.social_media import BatchSocialProcessor
processor = BatchSocialProcessor()
# Process multiple platforms
results = processor.process_batch(
platforms=["twitter", "reddit", "facebook"],
queries={
"twitter": ["#AI", "#MachineLearning"],
"reddit": ["artificial intelligence"],
"facebook": ["AI technology"]
},
date_range=("2023-01-01", "2023-12-31")
)
Data Quality and Validation¶
Quality Checks¶
- Content Validation: Check for spam, bots, and low-quality content
- Duplicate Detection: Identify duplicate posts and users
- Language Detection: Verify content language matches specified filters
- Geographic Validation: Validate location data accuracy
- Engagement Authenticity: Detect artificial engagement patterns
Rate Limiting and Throttling¶
from openmlcrawler.connectors.social_media import RateLimitedConnector
connector = RateLimitedConnector(
platform="twitter",
requests_per_minute=300,
burst_limit=50
)
# Automatic rate limiting
data = connector.get_data_with_rate_limit(query="AI")
Sentiment Analysis Integration¶
Built-in Sentiment Analysis¶
from openmlcrawler.connectors.social_media import SentimentAnalyzer
analyzer = SentimentAnalyzer()
# Analyze sentiment
results = analyzer.analyze_sentiment(
social_data=data,
model="vader", # vader, textblob, transformers
languages=["en", "es", "fr"]
)
# Get sentiment trends
trends = analyzer.get_sentiment_trends(
data=results,
time_window="daily",
group_by="platform"
)
Custom Sentiment Models¶
from openmlcrawler.connectors.social_media import CustomSentimentAnalyzer
analyzer = CustomSentimentAnalyzer(
model_path="path/to/custom/model",
tokenizer_path="path/to/tokenizer"
)
# Analyze with custom model
sentiment_scores = analyzer.analyze_custom(
texts=social_texts,
batch_size=32
)
Privacy and Compliance¶
Data Privacy Considerations¶
- User Consent: Respect platform privacy policies
- Data Minimization: Collect only necessary data fields
- Anonymization: Remove personally identifiable information
- Retention Policies: Implement data retention limits
- Access Controls: Secure API credentials and data access
GDPR Compliance¶
from openmlcrawler.connectors.social_media import GDPRCompliantConnector
connector = GDPRCompliantConnector(
enable_anonymization=True,
retention_days=90,
consent_required=True
)
# GDPR-compliant data collection
data = connector.collect_gdpr_compliant(
query="AI research",
anonymize_pii=True,
consent_verified=True
)
Configuration Options¶
Global Configuration¶
social_media_connectors:
default_platforms: ["twitter", "reddit"]
rate_limiting:
global_requests_per_minute: 1000
per_platform_limits:
twitter: 300
reddit: 600
facebook: 200
instagram: 200
data_quality:
enable_spam_filter: true
enable_duplicate_detection: true
min_content_length: 10
privacy:
enable_anonymization: true
retention_days: 90
gdpr_compliance: true
Platform-Specific Settings¶
twitter:
enable_premium_api: false
include_retweets: true
include_replies: false
tweet_mode: "extended"
reddit:
enable_praw: true
subreddit_limit: 100
comment_depth: 5
facebook:
enable_insights: true
page_access_token_required: true
include_video_insights: false
Best Practices¶
Performance Optimization¶
- Selective Data Collection: Only collect relevant fields
- Efficient Queries: Use platform-specific query optimization
- Caching Strategy: Cache frequently accessed data
- Batch Operations: Combine multiple requests when possible
Cost Management¶
- API Tier Selection: Choose appropriate API tiers
- Usage Monitoring: Track API usage and costs
- Fallback Strategies: Use free tiers for development
- Data Sampling: Sample data for analysis instead of full collection
Data Reliability¶
- Multiple Platforms: Use multiple social platforms for validation
- Error Recovery: Implement robust error handling and retries
- Data Validation: Comprehensive data quality checks
- Monitoring: Monitor collection health and data quality
Troubleshooting¶
Common Issues¶
Authentication Errors¶
Rate Limiting¶
Permission Denied¶
Data Unavailable¶
See Also¶
- Connectors Overview - Overview of all data connectors
- Data Processing - Processing social media data
- Quality & Privacy - Social data quality and privacy
- API Reference - Social media connector API
- Tutorials - Social media data tutorials