llm-intelligent-public-opinion-analytics

Deploy and use an LLM-powered public opinion analytics assistant that crawls 26 hot lists from 15 platforms, performs sentiment analysis, topic clustering, and multi-channel alerting

Skill file

Preview skill file
---
name: llm-intelligent-public-opinion-analytics
description: Deploy and use an LLM-powered public opinion analytics assistant that crawls 26 hot lists from 15 platforms, performs sentiment analysis, topic clustering, and multi-channel alerting
triggers:
  - set up public opinion monitoring system
  - analyze social media trending topics
  - deploy sentiment analysis crawler
  - configure hot topic push notifications
  - cluster trending news topics
  - monitor multiple platform hot searches
  - build opinion analytics dashboard
  - aggregate cross-platform trending content
---

# LLM-Based Intelligent Public Opinion Analytics Assistant

> Skill by [ara.so](https://ara.so) — Data Skills collection.

## Overview

This project is a comprehensive public opinion analytics platform that combines real-time data from **26 hot lists across 15 mainstream platforms** (Weibo, Bilibili, Zhihu, Baidu, etc.) with large language model (LLM) analysis capabilities. It provides conversational query interfaces for hot searches, topic clustering, sentiment analysis, and multi-channel push notifications (WeChat, Email, Telegram).

**Key Capabilities:**
- Real-time crawler cluster for 15+ platforms
- LLM-powered content analysis (including video content extraction)
- Natural language query interface
- Topic clustering and sentiment analysis
- Multi-channel alert system (Email, WeChat Work, Telegram)
- Keyboard shortcuts for crawler control

## Installation

### Prerequisites

1. **Browser Driver Setup** (Required for detail page scraping):

```bash
# Check your Chrome/Edge version first
# Chrome: chrome://settings/help
# Edge: edge://settings/help

# Download matching driver:
# ChromeDriver: https://chromedriver.chromium.org/
# EdgeDriver: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

# Linux/macOS - place driver in PATH:
sudo mv chromedriver /usr/local/bin/
sudo chmod +x /usr/local/bin/chromedriver

# Verify installation:
chromedriver --version
```

2. **MySQL Database**:

```bash
# Install MySQL 8.0+
# Create database and user
mysql -u root -p

CREATE DATABASE hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'hotsearch_user'@'localhost' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'hotsearch_user'@'localhost';
FLUSH PRIVILEGES;
```

3. **Python Environment**:

```bash
# Clone repository
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

### Database Initialization

Reference the `init.py` file to create necessary tables:

```python
# Example table structure (adapt from init.py)
import pymysql

connection = pymysql.connect(
    host='localhost',
    user='hotsearch_user',
    password='your_password',
    database='hotsearch_db',
    charset='utf8mb4'
)

cursor = connection.cursor()

# Hot search items table
cursor.execute("""
CREATE TABLE IF NOT EXISTS hot_search_items (
    id INT AUTO_INCREMENT PRIMARY KEY,
    platform VARCHAR(50) NOT NULL,
    rank INT,
    title VARCHAR(500) NOT NULL,
    url VARCHAR(1000),
    heat_value VARCHAR(100),
    crawl_time DATETIME NOT NULL,
    detail_content TEXT,
    sentiment VARCHAR(20),
    INDEX idx_platform (platform),
    INDEX idx_crawl_time (crawl_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
""")

connection.commit()
connection.close()
```

## Configuration

### Environment Variables

Create `.env` file in the project root:

```bash
# Database Configuration
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=hotsearch_user
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=hotsearch_db

# LLM API Configuration (OpenAI-compatible format)
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=https://your-llm-endpoint.com/v1
OPENAI_MODEL=gpt-4

# Huawei Pangu Model (recommended alternative)
PANGU_API_KEY=your_pangu_key
PANGU_API_BASE=https://pangu-api.huaweicloud.com

# Push Notification Channels
# Email (SMTP)
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=your_email@gmail.com
SMTP_PASSWORD=your_app_password

# WeChat Work Bot
WECHAT_WORK_WEBHOOK=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY

# WeChat Work Application
WECHAT_WORK_CORP_ID=your_corp_id
WECHAT_WORK_APP_SECRET=your_app_secret
WECHAT_WORK_AGENT_ID=your_agent_id

# Telegram Bot
TELEGRAM_BOT_TOKEN=your_bot_token
TELEGRAM_CHAT_ID=your_chat_id
```

### Crawler Settings

Edit `hotsearchcrawler/settings.py`:

```python
# MySQL Connection Pool
MYSQL_CONFIG = {
    'host': os.getenv('MYSQL_HOST', 'localhost'),
    'port': int(os.getenv('MYSQL_PORT', 3306)),
    'user': os.getenv('MYSQL_USER'),
    'password': os.getenv('MYSQL_PASSWORD'),
    'database': os.getenv('MYSQL_DATABASE'),
    'charset': 'utf8mb4',
    'autocommit': True
}

# Optional: Platform-specific cookies for authenticated access
PLATFORM_COOKIES = {
    'weibo': 'your_weibo_cookies',  # Optional, for better access
    'bilibili': 'your_bilibili_cookies'
}

# Concurrent requests
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1

# User-Agent rotation
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
]
```

## Usage

### Starting the System

```bash
# Activate virtual environment
source venv/bin/activate

# Start the main application (web interface + API)
python app.py

# Access web interface at http://localhost:5000
```

### Crawler Management

```python
# Manual crawler test (single platform)
cd hotsearchcrawler
python runspider-test.py

# Start all crawlers (typically triggered via web UI)
python run_spiders.py
```

**Via Web Interface:**
- Use keyboard shortcuts to start/stop crawlers
- View real-time crawling status
- Monitor data collection metrics

### Natural Language Queries

```python
# Examples of conversational queries via web interface:

# "Show me today's top 10 trending topics on Weibo"
# "What's trending about AI technology across all platforms?"
# "Analyze sentiment for news about electric vehicles"
# "Cluster topics related to economic policy"
# "Compare hot topics between Bilibili and Zhihu"
```

### Programmatic API Usage

```python
from hotsearch_analysis_agent.analyzer import OpinionAnalyzer
from datetime import datetime, timedelta

# Initialize analyzer
analyzer = OpinionAnalyzer()

# Query hot searches
results = analyzer.query_hot_searches(
    platforms=['weibo', 'zhihu', 'bilibili'],
    time_range=(datetime.now() - timedelta(hours=24), datetime.now()),
    keyword='人工智能'
)

# Perform sentiment analysis
sentiment = analyzer.analyze_sentiment(results)
print(f"Overall sentiment: {sentiment['overall']}")
print(f"Positive: {sentiment['positive_ratio']}%")

# Topic clustering
clusters = analyzer.cluster_topics(results, num_clusters=5)
for i, cluster in enumerate(clusters):
    print(f"Cluster {i+1}: {cluster['keywords']}")
    print(f"  Items: {len(cluster['items'])}")
```

### Push Notification Setup

```python
from hotsearch_analysis_agent.push_service import PushService

# Initialize push service
push_service = PushService()

# Create scheduled push task
task = push_service.create_task(
    name="AI Technology Daily Report",
    keywords=['人工智能', '大模型', '机器学习'],
    platforms=['weibo', 'zhihu', 'bilibili'],
    schedule='0 8,12,18 * * *',  # Cron format: 8am, 12pm, 6pm daily
    channels=['wechat_work', 'email'],
    threshold={'heat_value': 100000, 'sentiment': 'positive'}
)

# Test push task
python test_push_task.py
```

### Analysis Report Generation

```python
from hotsearch_analysis_agent.report_generator import ReportGenerator

generator = ReportGenerator()

# Generate comprehensive report
report = generator.generate_report(
    topic="人工智能与前沿科技",
    time_range=(datetime.now() - timedelta(days=7), datetime.now()),
    include_sentiment=True,
    include_clustering=True,
    include_trend_analysis=True
)

# Report includes:
# - Core findings with data highlights
# - Detailed news content with source URLs
# - Sentiment distribution
# - Topic clusters
# - Trend analysis
# - Information spread characteristics

# Save report
report.save_markdown('output/ai_tech_report.md')
report.save_pdf('output/ai_tech_report.pdf')
```

## Common Patterns

### Multi-Platform Data Aggregation

```python
from hotsearch_analysis_agent.aggregator import DataAggregator

aggregator = DataAggregator()

# Fetch and merge data from multiple platforms
merged_data = aggregator.aggregate(
    platforms=['weibo', 'douyin', 'zhihu', 'bilibili', 'baidu'],
    dedup_threshold=0.8,  # Similarity threshold for deduplication
    sort_by='heat_value',
    limit=50
)

# Cross-platform topic correlation
correlations = aggregator.find_correlations(merged_data)
print(f"Found {len(correlations)} cross-platform trending topics")
```

### Video Content Analysis

```python
# The system automatically extracts text from video news
# using browser automation and LLM analysis

from hotsearch_analysis_agent.video_analyzer import VideoAnalyzer

video_analyzer = VideoAnalyzer()

# Analyze video-based hot topics (e.g., from Bilibili, Douyin)
video_topics = video_analyzer.extract_content(
    url='https://www.bilibili.com/video/BV13pSoBBEvX/',
    extract_comments=True,
    max_comments=100
)

print(f"Video title: {video_topics['title']}")
print(f"Description: {video_topics['description']}")
print(f"Top comments sentiment: {video_topics['comments_sentiment']}")
```

### Custom LLM Integration

```python
from hotsearch_analysis_agent.llm_client import LLMClient

# Use Huawei Pangu Model (recommended)
llm = LLMClient(
    api_base=os.getenv('PANGU_API_BASE'),
    api_key=os.getenv('PANGU_API_KEY'),
    model='pangu-embedded-7b'
)

# Or use any OpenAI-compatible endpoint
llm = LLMClient(
    api_base=os.getenv('OPENAI_API_BASE'),
    api_key=os.getenv('OPENAI_API_KEY'),
    model='gpt-4'
)

# Analyze custom content
analysis = llm.analyze(
    content=news_content,
    task='sentiment_and_summary',
    language='zh'
)
```

### Scheduled Monitoring

```python
from hotsearch_analysis_agent.scheduler import MonitorScheduler

scheduler = MonitorScheduler()

# Add monitoring rule
scheduler.add_rule(
    name="Tech Company Crisis Monitoring",
    keywords=['某公司', '丑闻', '争议'],
    alert_conditions={
        'heat_spike': 2.0,  # 2x normal heat
        'sentiment_drop': -0.3,  # 30% sentiment decrease
        'platforms_count': 3  # Trending on 3+ platforms
    },
    notification_channels=['wechat_work', 'telegram', 'email'],
    urgent=True
)

# Start scheduler
scheduler.start()
```

## Troubleshooting

### Browser Driver Issues

```bash
# Error: "Message: 'chromedriver' executable needs to be in PATH"
# Solution: Verify driver installation
which chromedriver  # Should return path

# If not found, reinstall:
# 1. Check browser version
google-chrome --version  # or microsoft-edge --version

# 2. Download exact matching driver version
# 3. Place in /usr/local/bin/ and chmod +x

# Alternative: Specify driver path in settings
CHROMEDRIVER_PATH=/path/to/chromedriver
```

### Database Connection Errors

```python
# Error: "Can't connect to MySQL server"
# Check MySQL service
sudo systemctl status mysql

# Verify credentials
mysql -u hotsearch_user -p -h localhost hotsearch_db

# Check .env file encoding (must be UTF-8 without BOM)
file -I .env  # Should show charset=utf-8

# Test connection in Python
import pymysql
try:
    conn = pymysql.connect(
        host=os.getenv('MYSQL_HOST'),
        user=os.getenv('MYSQL_USER'),
        password=os.getenv('MYSQL_PASSWORD'),
        database=os.getenv('MYSQL_DATABASE')
    )
    print("Connection successful")
except Exception as e:
    print(f"Error: {e}")
```

### Crawler Rate Limiting

```python
# Error: HTTP 429 or blocked requests
# Solution: Adjust crawler settings

# In hotsearchcrawler/settings.py:
CONCURRENT_REQUESTS = 8  # Reduce from 16
DOWNLOAD_DELAY = 2  # Increase delay

# Enable AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10

# Rotate User-Agents and proxies
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
```

### LLM API Timeouts

```python
# Error: Request timeout or rate limit
# Solution: Implement retry logic and fallback

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(prompt):
    return llm.analyze(prompt)

# Use batch processing for large datasets
from hotsearch_analysis_agent.batch_processor import BatchProcessor

processor = BatchProcessor(batch_size=10, delay=2)
results = processor.process_items(news_items, analyze_func)
```

### Memory Issues with Large Datasets

```python
# Error: MemoryError or slow processing
# Solution: Use pagination and streaming

from hotsearch_analysis_agent.db_client import DBClient

db = DBClient()

# Stream results instead of loading all at once
for batch in db.stream_hot_searches(batch_size=100):
    process_batch(batch)
    # Process and discard to free memory

# Use database aggregation instead of in-memory
aggregated = db.aggregate_by_platform(
    start_date='2026-01-01',
    end_date='2026-05-01'
)
```

## Project Structure Reference

```
.
├── app.py                          # Main application entry
├── hotsearch_analysis_agent/       # Analysis system
│   ├── analyzer.py                 # Core analysis logic
│   ├── llm_client.py              # LLM integration
│   ├── report_generator.py        # Report generation
│   ├── push_service.py            # Notification service
│   └── scheduler.py               # Task scheduling
├── hotsearchcrawler/              # Crawler cluster
│   ├── spiders/                   # Platform-specific spiders
│   ├── settings.py                # Crawler settings
│   └── run_spiders.py            # Crawler launcher
├── test_push_task.py              # Push notification testing
├── runspider-test.py              # Single crawler testing
├── init.py                        # Database initialization
├── requirements.txt               # Python dependencies
└── .env                          # Environment configuration
```

## Best Practices

1. **Database Indexing**: Ensure indexes on `platform`, `crawl_time`, and `title` columns for fast queries
2. **LLM Cost Management**: Cache analysis results to avoid redundant API calls
3. **Crawler Politeness**: Respect platform rate limits and robots.txt
4. **Notification Throttling**: Implement cooldown periods to avoid alert fatigue
5. **Data Retention**: Set up automatic archival for data older than 90 days
6. **Model Choice**: Consider Huawei Pangu for better Chinese language understanding and local deployment

Source

Creator's repository · aradotso/data-skills

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk