llm-intelligent-public-opinion-analytics

Deploy and use an LLM-powered public opinion analytics assistant that crawls 26 hot lists from 15 platforms, performs sentiment analysis, topic clustering, and multi-channel alerting

Skill file

Preview skill file↓↑

---
name: llm-intelligent-public-opinion-analytics
description: Deploy and use an LLM-powered public opinion analytics assistant that crawls 26 hot lists from 15 platforms, performs sentiment analysis, topic clustering, and multi-channel alerting
triggers:
  - set up public opinion monitoring system
  - analyze social media trending topics
  - deploy sentiment analysis crawler
  - configure hot topic push notifications
  - cluster trending news topics
  - monitor multiple platform hot searches
  - build opinion analytics dashboard
  - aggregate cross-platform trending content
---

# LLM-Based Intelligent Public Opinion Analytics Assistant

> Skill by [ara.so](https://ara.so) — Data Skills collection.

## Overview

This project is a comprehensive public opinion analytics platform that combines real-time data from **26 hot lists across 15 mainstream platforms** (Weibo, Bilibili, Zhihu, Baidu, etc.) with large language model (LLM) analysis capabilities. It provides conversational query interfaces for hot searches, topic clustering, sentiment analysis, and multi-channel push notifications (WeChat, Email, Telegram).

**Key Capabilities:**
- Real-time crawler cluster for 15+ platforms
- LLM-powered content analysis (including video content extraction)
- Natural language query interface
- Topic clustering and sentiment analysis
- Multi-channel alert system (Email, WeChat Work, Telegram)
- Keyboard shortcuts for crawler control

## Installation

### Prerequisites

1. **Browser Driver Setup** (Required for detail page scraping):

```bash
# Check your Chrome/Edge version first
# Chrome: chrome://settings/help
# Edge: edge://settings/help

# Download matching driver:
# ChromeDriver: https://chromedriver.chromium.org/
# EdgeDriver: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

# Linux/macOS - place driver in PATH:
sudo mv chromedriver /usr/local/bin/
sudo chmod +x /usr/local/bin/chromedriver

# Verify installation:
chromedriver --version
```

2. **MySQL Database**:

```bash
# Install MySQL 8.0+
# Create database and user
mysql -u root -p

CREATE DATABASE hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'hotsearch_user'@'localhost' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'hotsearch_user'@'localhost';
FLUSH PRIVILEGES;
```

3. **Python Environment**:

```bash
# Clone repository
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

### Database Initialization

Reference the `init.py` file to create necessary tables:

```python
# Example table structure (adapt from init.py)
import pymysql

connection = pymysql.connect(
    host='localhost',
    user='hotsearch_user',
    password='your_password',
    database='hotsearch_db',
    charset='utf8mb4'
)

cursor = connection.cursor()

# Hot search items table
cursor.execute("""
CREATE TABLE IF NOT EXISTS hot_search_items (
    id INT AUTO_INCREMENT PRIMARY KEY,
    platform VARCHAR(50) NOT NULL,
    rank INT,
    title VARCHAR(500) NOT NULL,
    url VARCHAR(1000),
    heat_value VARCHAR(100),
    crawl_time DATETIME NOT NULL,
    detail_content TEXT,
    sentiment VARCHAR(20),
    INDEX idx_platform (platform),
    INDEX idx_crawl_time (crawl_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
""")

connection.commit()
connection.close()
```

## Configuration

### Environment Variables

Create `.env` file in the project root:

```bash
# Database Configuration
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=hotsearch_user
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=hotsearch_db

# LLM API Configuration (OpenAI-compatible format)
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=https://your-llm-endpoint.com/v1
OPENAI_MODEL=gpt-4

# Huawei Pangu Model (recommended alternative)
PANGU_API_KEY=your_pangu_key
PANGU_API_BASE=https://pangu-api.huaweicloud.com

# Push Notification Channels
# Email (SMTP)
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=your_email@gmail.com
SMTP_PASSWORD=your_app_password

# WeChat Work Bot
WECHAT_WORK_WEBHOOK=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY

# WeChat Work Application
WECHAT_WORK_CORP_ID=your_corp_id
WECHAT_WORK_APP_SECRET=your_app_secret
WECHAT_WORK_AGENT_ID=your_agent_id

# Telegram Bot
TELEGRAM_BOT_TOKEN=your_bot_token
TELEGRAM_CHAT_ID=your_chat_id
```

### Crawler Settings

Edit `hotsearchcrawler/settings.py`:

```python
# MySQL Connection Pool
MYSQL_CONFIG = {
    'host': os.getenv('MYSQL_HOST', 'localhost'),
    'port': int(os.getenv('MYSQL_PORT', 3306)),
    'user': os.getenv('MYSQL_USER'),
    'password': os.getenv('MYSQL_PASSWORD'),
    'database': os.getenv('MYSQL_DATABASE'),
    'charset': 'utf8mb4',
    'autocommit': True
}

# Optional: Platform-specific cookies for authenticated access
PLATFORM_COOKIES = {
    'weibo': 'your_weibo_cookies',  # Optional, for better access
    'bilibili': 'your_bilibili_cookies'
}

# Concurrent requests
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1

# User-Agent rotation
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
]
```

## Usage

### Starting the System

```bash
# Activate virtual environment
source venv/bin/activate

# Start the main application (web interface + API)
python app.py

# Access web interface at http://localhost:5000
```

### Crawler Management

```python
# Manual crawler test (single platform)
cd hotsearchcrawler
python runspider-test.py

# Start all crawlers (typically triggered via web UI)
python run_spiders.py
```

**Via Web Interface:**
- Use keyboard shortcuts to start/stop crawlers
- View real-time crawling status
- Monitor data collection metrics

### Natural Language Queries

```python
# Examples of conversational queries via web interface:

# "Show me today's top 10 trending topics on Weibo"
# "What's trending about AI technology across all platforms?"
# "Analyze sentiment for news about electric vehicles"
# "Cluster topics related to economic policy"
# "Compare hot topics between Bilibili and Zhihu"
```

### Programmatic API Usage

```python
from hotsearch_analysis_agent.analyzer import OpinionAnalyzer
from datetime import datetime, timedelta

# Initialize analyzer
analyzer = OpinionAnalyzer()

# Query hot searches
results = analyzer.query_hot_searches(
    platforms=['weibo', 'zhihu', 'bilibili'],
    time_range=(datetime.now() - timedelta(hours=24), datetime.now()),
    keyword='人工智能'
)

# Perform sentiment analysis
sentiment = analyzer.analyze_sentiment(results)
print(f"Overall sentiment: {sentiment['overall']}")
print(f"Positive: {sentiment['positive_ratio']}%")

# Topic clustering
clusters = analyzer.cluster_topics(results, num_clusters=5)
for i, cluster in enumerate(clusters):
    print(f"Cluster {i+1}: {cluster['keywords']}")
    print(f"  Items: {len(cluster['items'])}")
```

### Push Notification Setup

```python
from hotsearch_analysis_agent.push_service import PushService

# Initialize push service
push_service = PushService()

# Create scheduled push task
task = push_service.create_task(
    name="AI Technology Daily Report",
    keywords=['人工智能', '大模型', '机器学习'],
    platforms=['weibo', 'zhihu', 'bilibili'],
    schedule='0 8,12,18 * * *',  # Cron format: 8am, 12pm, 6pm daily
    channels=['wechat_work', 'email'],
    threshold={'heat_value': 100000, 'sentiment': 'positive'}
)

# Test push task
python test_push_task.py
```

### Analysis Report Generation

```python
from hotsearch_analysis_agent.report_generator import ReportGenerator

generator = ReportGenerator()

# Generate comprehensive report
report = generator.generate_report(
    topic="人工智能与前沿科技",
    time_range=(datetime.now() - timedelta(days=7), datetime.now()),
    include_sentiment=True,
    include_clustering=True,
    include_trend_analysis=True
)

# Report includes:
# - Core findings with data highlights
# - Detailed news content with source URLs
# - Sentiment distribution
# - Topic clusters
# - Trend analysis
# - Information spread characteristics

# Save report
report.save_markdown('output/ai_tech_report.md')
report.save_pdf('output/ai_tech_report.pdf')
```

## Common Patterns

### Multi-Platform Data Aggregation

```python
from hotsearch_analysis_agent.aggregator import DataAggregator

aggregator = DataAggregator()

# Fetch and merge data from multiple platforms
merged_data = aggregator.aggregate(
    platforms=['weibo', 'douyin', 'zhihu', 'bilibili', 'baidu'],
    dedup_threshold=0.8,  # Similarity threshold for deduplication
    sort_by='heat_value',
    limit=50
)

# Cross-platform topic correlation
correlations = aggregator.find_correlations(merged_data)
print(f"Found {len(correlations)} cross-platform trending topics")
```

### Video Content Analysis

```python
# The system automatically extracts text from video news
# using browser automation and LLM analysis

from hotsearch_analysis_agent.video_analyzer import VideoAnalyzer

video_analyzer = VideoAnalyzer()

# Analyze video-based hot topics (e.g., from Bilibili, Douyin)
video_topics = video_analyzer.extract_content(
    url='https://www.bilibili.com/video/BV13pSoBBEvX/',
    extract_comments=True,
    max_comments=100
)

print(f"Video title: {video_topics['title']}")
print(f"Description: {video_topics['description']}")
print(f"Top comments sentiment: {video_topics['comments_sentiment']}")
```

### Custom LLM Integration

```python
from hotsearch_analysis_agent.llm_client import LLMClient

# Use Huawei Pangu Model (recommended)
llm = LLMClient(
    api_base=os.getenv('PANGU_API_BASE'),
    api_key=os.getenv('PANGU_API_KEY'),
    model='pangu-embedded-7b'
)

# Or use any OpenAI-compatible endpoint
llm = LLMClient(
    api_base=os.getenv('OPENAI_API_BASE'),
    api_key=os.getenv('OPENAI_API_KEY'),
    model='gpt-4'
)

# Analyze custom content
analysis = llm.analyze(
    content=news_content,
    task='sentiment_and_summary',
    language='zh'
)
```

### Scheduled Monitoring

```python
from hotsearch_analysis_agent.scheduler import MonitorScheduler

scheduler = MonitorScheduler()

# Add monitoring rule
scheduler.add_rule(
    name="Tech Company Crisis Monitoring",
    keywords=['某公司', '丑闻', '争议'],
    alert_conditions={
        'heat_spike': 2.0,  # 2x normal heat
        'sentiment_drop': -0.3,  # 30% sentiment decrease
        'platforms_count': 3  # Trending on 3+ platforms
    },
    notification_channels=['wechat_work', 'telegram', 'email'],
    urgent=True
)

# Start scheduler
scheduler.start()
```

## Troubleshooting

### Browser Driver Issues

```bash
# Error: "Message: 'chromedriver' executable needs to be in PATH"
# Solution: Verify driver installation
which chromedriver  # Should return path

# If not found, reinstall:
# 1. Check browser version
google-chrome --version  # or microsoft-edge --version

# 2. Download exact matching driver version
# 3. Place in /usr/local/bin/ and chmod +x

# Alternative: Specify driver path in settings
CHROMEDRIVER_PATH=/path/to/chromedriver
```

### Database Connection Errors

```python
# Error: "Can't connect to MySQL server"
# Check MySQL service
sudo systemctl status mysql

# Verify credentials
mysql -u hotsearch_user -p -h localhost hotsearch_db

# Check .env file encoding (must be UTF-8 without BOM)
file -I .env  # Should show charset=utf-8

# Test connection in Python
import pymysql
try:
    conn = pymysql.connect(
        host=os.getenv('MYSQL_HOST'),
        user=os.getenv('MYSQL_USER'),
        password=os.getenv('MYSQL_PASSWORD'),
        database=os.getenv('MYSQL_DATABASE')
    )
    print("Connection successful")
except Exception as e:
    print(f"Error: {e}")
```

### Crawler Rate Limiting

```python
# Error: HTTP 429 or blocked requests
# Solution: Adjust crawler settings

# In hotsearchcrawler/settings.py:
CONCURRENT_REQUESTS = 8  # Reduce from 16
DOWNLOAD_DELAY = 2  # Increase delay

# Enable AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10

# Rotate User-Agents and proxies
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
```

### LLM API Timeouts

```python
# Error: Request timeout or rate limit
# Solution: Implement retry logic and fallback

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(prompt):
    return llm.analyze(prompt)

# Use batch processing for large datasets
from hotsearch_analysis_agent.batch_processor import BatchProcessor

processor = BatchProcessor(batch_size=10, delay=2)
results = processor.process_items(news_items, analyze_func)
```

### Memory Issues with Large Datasets

```python
# Error: MemoryError or slow processing
# Solution: Use pagination and streaming

from hotsearch_analysis_agent.db_client import DBClient

db = DBClient()

# Stream results instead of loading all at once
for batch in db.stream_hot_searches(batch_size=100):
    process_batch(batch)
    # Process and discard to free memory

# Use database aggregation instead of in-memory
aggregated = db.aggregate_by_platform(
    start_date='2026-01-01',
    end_date='2026-05-01'
)
```

## Project Structure Reference

```
.
├── app.py                          # Main application entry
├── hotsearch_analysis_agent/       # Analysis system
│   ├── analyzer.py                 # Core analysis logic
│   ├── llm_client.py              # LLM integration
│   ├── report_generator.py        # Report generation
│   ├── push_service.py            # Notification service
│   └── scheduler.py               # Task scheduling
├── hotsearchcrawler/              # Crawler cluster
│   ├── spiders/                   # Platform-specific spiders
│   ├── settings.py                # Crawler settings
│   └── run_spiders.py            # Crawler launcher
├── test_push_task.py              # Push notification testing
├── runspider-test.py              # Single crawler testing
├── init.py                        # Database initialization
├── requirements.txt               # Python dependencies
└── .env                          # Environment configuration
```

## Best Practices

1. **Database Indexing**: Ensure indexes on `platform`, `crawl_time`, and `title` columns for fast queries
2. **LLM Cost Management**: Cache analysis results to avoid redundant API calls
3. **Crawler Politeness**: Respect platform rate limits and robots.txt
4. **Notification Throttling**: Implement cooldown periods to avoid alert fatigue
5. **Data Retention**: Set up automatic archival for data older than 90 days
6. **Model Choice**: Consider Huawei Pangu for better Chinese language understanding and local deployment

Source

Creator's repository · aradotso/data-skills

View on GitHub ↗

Security

Security checks in progress

Results will appear here once audits complete

What this skill can do

Reads your filesConnects to the internetRuns code on your machine

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk