Unit 1: Digital Footprints & Trends

🎯

What's Everyone Talking About?

Discovering Themes with Topic Modeling

You've collected 10,000 Reddit posts about climate change.

How do you find the main themes?

You can't read them all by hand.

📚 The Problem: Too Much Text

Scenario:

You're analyzing customer feedback for a product launch:

10,000 support tickets
5,000 product reviews
20,000 social media mentions

35,000 pieces of text. What are customers actually complaining about?

Manual Approaches (Don't Scale):

Option 1: Read everything

⏱️ Time: ~300 hours (at 2 minutes per document)

❌ Not realistic

Option 2: Random sample

📊 Read 100 random documents, guess themes

⚠️ Might miss rare but important issues

What you need:

A method that automatically discovers themes without manual reading or pre-defined categories.

Enter: Topic Modeling

🔍 What is Topic Modeling?

Definition:

Topic modeling is a machine learning technique that automatically discovers themes (topics) in a collection of documents.

How it works (simplified):

You give the algorithm 10,000 documents
Algorithm looks for words that appear together frequently
Groups these word patterns into "topics"
Returns: "Here are the 8 main topics in your data"

Example Output:

Topic 1

battery charging power drain overnight

→ Likely about battery issues

Topic 2

screen cracked display broken glass

→ Likely about screen damage

The magic: The algorithm didn't know these categories existed. It discovered them from patterns in the text.

🎯 When to Use Topic Modeling

Perfect for:

📰 News Analysis

"What topics dominated news coverage in 2025?"

Analyze 100,000 articles from NYT, CNN, BBC
Discover: Politics, Tech, Climate, Economy, Health...

📱 Social Media Research

"What are people discussing about my brand?"

Analyze 50,000 tweets mentioning your product
Discover: Quality issues, customer service, pricing, features...

🔬 Academic Research

"What are the major themes in 20 years of climate research?"

Analyze 5,000 research paper abstracts
Track how topics evolved over time

Common pattern:

You have lots of text, you want to know what themes exist, and you don't want to read everything manually.

⚖️ Two Methods: LDA vs BERTopic

We'll learn two topic modeling approaches:

🏛️ LDA (2003)

Latent Dirichlet Allocation

The "classic" method, still widely used

How it works:

Bag-of-words approach
Probabilistic model
Assumes topics are distributions of words

Pros: Fast, interpretable, battle-tested

Cons: Ignores word order, requires choosing # of topics

🚀 BERTopic (2020)

Modern deep learning approach

State-of-the-art in 2026

How it works:

Uses embeddings (context-aware)
Clustering + dimensionality reduction
Automatically finds # of topics

Pros: More accurate, understands context, auto-determines topics

Cons: Slower, requires more setup

Which should you use?

LDA: Quick exploration, limited resources, need interpretability
BERTopic: Higher accuracy needed, have compute power, production systems

We'll demo both so you can choose!

🏛️ Method 1: LDA (Latent Dirichlet Allocation)

Demo Dataset:

1,000 Reddit posts from r/technology

Topics covered: AI, smartphones, privacy, electric vehicles, social media...

Goal: Let LDA discover these topics automatically

How LDA Works:

Step 1: Preprocessing

Clean text, remove stop words ("the", "is", "and"), create bag-of-words

↓

Step 2: Choose # of topics (K)

You decide: "I want to find 8 topics"

(This is a limitation - you must guess in advance)

↓

Step 3: LDA Algorithm

Iteratively assigns words to topics based on co-occurrence patterns

↓

Step 4: Output

8 topics, each represented by top 10 words

Key Limitation:

Topics are labeled as "Topic 1", "Topic 2", etc.

YOU must interpret what each topic means by looking at the words.

📊 LDA Results: r/technology

Here's what LDA discovered from 1,000 r/technology posts (K=8):

Topic 1

chatgpt ai openai model training gpt

Human interpretation: AI/LLMs

Topic 2

iphone apple android samsung phone battery

Human interpretation: Smartphones

Topic 3

privacy data tracking facebook google personal

Human interpretation: Privacy concerns

Pretty good!

LDA found coherent topics without being told what to look for.

But notice:

We had to label them ourselves ("AI/LLMs", "Smartphones")
Some words appear in multiple topics ("battery")
We guessed K=8. What if there are really 6 topics? Or 10?

🎮 LDA Quiz

Question 1

What is LDA's main input that you must decide in advance?

A) The number of topics (K)

B) The number of documents

C) The topic labels

D) The keywords for each topic

Question 2

How does LDA represent topics?

A) Human-readable labels (e.g., "Technology")

B) Lists of top words (e.g., "ai, chatgpt, model...")

C) Example documents

D) Numerical codes

Question 3

What is a major limitation of LDA?

A) It's too slow for large datasets

B) It only works on social media data

C) It ignores word order and context

D) It requires labeled training data

🚀 Method 2: BERTopic (Modern Approach)

What makes BERTopic different?

BERTopic uses embeddings - numerical representations that capture word meaning and context.

Key Improvements Over LDA:

1. Understands Context

LDA sees "bank" and doesn't know if it means:

"Bank" (financial institution)
"Bank" (river bank)

BERTopic uses context to tell them apart.

2. Auto-Determines # of Topics

LDA: You choose K=8

BERTopic: "I found 12 distinct topics in your data"

3. Better Topic Coherence

Topics make more sense semantically because embeddings capture meaning.

How BERTopic Works:

Step 1: Generate Embeddings

Convert each document to a 768-dimensional vector (using BERT or similar)

↓

Step 2: Reduce Dimensions

Use UMAP to reduce 768 dimensions to 2D (for visualization/clustering)

↓

Step 3: Cluster Documents

Use HDBSCAN to find natural clusters = topics

↓

Step 4: Extract Topic Words

Find words that best represent each cluster

📊 BERTopic Results: Same Dataset

Running BERTopic on the same 1,000 r/technology posts:

Topic 1: Large Language Models

chatgpt openai gpt-4 llm claude hallucination

More specific than LDA's "AI" - focuses on LLMs specifically

Topic 2: iPhone Release & Features

iphone 16 apple release camera specs

More coherent - focuses on iPhone specifically, not all smartphones

Topic 3: Facebook/Meta Privacy Scandals

facebook meta zuckerberg tracking lawsuit

Specific to Facebook privacy issues, not general privacy

Notice the difference:

✅ More specific, coherent topics
✅ Automatically found 12 topics (not 8)
✅ Better separation (no overlap)

But still:

Topics are still just word lists. Humans must interpret them.

Next: Can we automate the labeling too?

🎯 CommDAAF Checkpoint: What Gets Lost in Categorization?

📊 DISCOVER

Run topic modeling on this scenario:

Dataset: 5,000 customer support tickets

Result: BERTopic finds 8 clean topics (billing, shipping, returns, etc.)

But: 500 tickets (10%) don't fit any topic cleanly - assigned to "Outliers"

What happens to these 500 "outlier" tickets?

🔍 ANALYZE

Consider this scenario:

A customer writes: "The product is great, but the packaging uses too much plastic and I'm concerned about sustainability."

Which topic does this fit?

Product Quality (mentions "great")
Shipping/Packaging (mentions "packaging")
Environmental Concerns (mentions "sustainability")

Topic modeling will force this into ONE category. What nuance gets lost?

⚖️ ASSESS

Who decides the "right" number of topics?

K=5: Very broad categories, lots of nuance lost
K=20: Very specific topics, harder to get insights
K=10: Middle ground, but who says 10 is "right"?

How does the choice of K shape what you "discover" in the data?

🛠️ FORMULATE

Design a topic modeling system that preserves nuance:

How would you handle:

Documents that fit multiple topics?
Outliers that might be important?
Ambiguity in topic boundaries?

Propose a solution that balances categorization (for scale) with nuance (for accuracy).

💾 Your responses are saved to your learning journal

🤖 LLM-Based Topic Labeling

The Problem So Far:

Both LDA and BERTopic give you word lists:

Topic 1: chatgpt, openai, gpt-4, llm, claude, hallucination

YOU must interpret: "This is probably about large language models"

The 2026 Solution:

Use LLMs (GPT-4, Claude) to automatically label topics!

How it works:

Step 1: Extract Topic Words

BERTopic gives you: ["chatgpt", "openai", "gpt-4", "llm", "claude"]

↓

Step 2: Get Sample Documents

Pull 5-10 representative documents from this topic

↓

Step 3: Prompt LLM

"Based on these keywords and sample documents, provide a concise 2-4 word label for this topic."

↓

Step 4: LLM Output

"Large Language Models"

Fully Automated Pipeline:

BERTopic discovers topics → LLM labels them → You get human-readable results

📊 LLM Labeling Results

Here's BERTopic output before and after LLM labeling:

BERTopic Keywords	LLM Label
chatgpt, openai, gpt-4, llm, claude	Large Language Models
iphone 16, apple, release, camera, specs	iPhone Product Launch
facebook, meta, zuckerberg, tracking, lawsuit	Facebook Privacy Scandals
tesla, ev, electric, autopilot, musk	Tesla & Electric Vehicles
tiktok, ban, congress, national security	TikTok Ban Debate

Much clearer!

Now anyone can understand the topics without needing to interpret keywords.

Question: How accurate are LLM labels?

Answer: Compare to human labels:

Agreement rate: ~85-90%
Most disagreements: Level of specificity
Example: LLM says "iPhone Launch", human says "Apple Products"

Good enough for most use cases!

🎯 CommDAAF Checkpoint: Algorithmic Agenda Setting

📊 DISCOVER

News aggregators (Google News, Apple News, Flipboard) use topic modeling to cluster stories.

Scenario:

Google News analyzes 100,000 articles daily and groups them into topics.

Homepage shows: Top 10 topics (based on # of articles)

You only see: What the algorithm decided were the "main topics"

Which topics get prominence? Which get buried?

🔍 ANALYZE

Topic modeling can shape agenda setting - what people think is important.

Volume vs. Importance:

Topic modeling ranks by volume (# of articles/posts)
But is "most discussed" the same as "most important"?

Example:

2025 Twitter: 1M tweets about Taylor Swift's tour

2025 Twitter: 10K tweets about new climate report

Topic modeling shows: "Taylor Swift" is 100x bigger topic

Does this reflect what's actually newsworthy?

⚖️ ASSESS

Can topic modeling shape what people think is "newsworthy"?

Feedback loop:

Topic modeling ranks by volume
News aggregators show high-volume topics first
People click on those topics
More articles get written on those topics
Volume increases further...

Does this amplify existing biases in news coverage?

🛠️ FORMULATE

Design a transparent topic discovery system for journalism:

How would you balance:

Volume (what's trending)
Importance (what matters)
Diversity (underreported topics)
Novelty (new issues emerging)

Propose a topic ranking system that goes beyond simple volume.

💾 Your responses are saved to your learning journal

🎮 Topic Guessing Game

You're given 50 Reddit post titles. How many distinct topics do you think there are?

Sample Titles:

"ChatGPT just passed the bar exam"
"New iPhone 16 battery lasts 2 days"
"Facebook settles privacy lawsuit for $1B"
"Tesla recalls 2M vehicles for autopilot issue"
"Congress debates TikTok ban"
"Google's new AI can generate videos"
"Apple announces Vision Pro delay"
"Meta's VR headset sales disappoint"
"OpenAI releases GPT-5 preview"
"Samsung Galaxy S25 leaked specs"
...and 40 more

Question

How many topics do you think exist in these 50 posts?

A) 3-4 topics (very broad categories)

B) 5-7 topics (medium granularity)

C) 8-10 topics (specific themes)

D) 12+ topics (very specific)

🔄 The Complete Workflow (2026 Best Practice)

From Lesson 6 (Text Analytics), you learned:

Supervised classifiers need labeled training data

Problem: Labeling 10,000 documents by hand is slow

The 2026 Solution:

Topic Modeling → LLM Labeling → Supervised Classifier

Step-by-Step:

Step 1: Topic Modeling

Run BERTopic on 10,000 unlabeled documents

→ Discovers 8 topics automatically

↓

Step 2: LLM Topic Labeling

Use GPT-4/Claude to label the 8 topics

→ Get human-readable labels

↓

Step 3: Auto-Labeling Training Data

Assign each document to its most likely topic

→ Now you have 10,000 labeled examples!

↓

Step 4: Train Supervised Classifier

Train a fast classifier (logistic regression, random forest)

→ Can classify new documents in real-time

↓

Step 5: Production

Use classifier for new data (fast, cheap)

Periodically re-run topic modeling to find new emerging topics

Why this workflow?

✅ No manual labeling needed
✅ Discovers categories from data (not pre-defined)
✅ Fast classifier for production
✅ Can adapt to new topics over time

This is how professionals do it in 2026!

🎓 Key Takeaways

What You Learned:

✅ Topic Modeling

Automatically discovers themes in large text collections without manual reading or pre-defined categories

✅ Two Methods

LDA (2003): Fast, interpretable, requires choosing K
BERTopic (2020): More accurate, auto-determines topics, context-aware

✅ LLM Labeling

Use GPT-4/Claude to automatically generate human-readable topic labels from keywords

✅ Production Workflow

Topic Modeling → LLM Labeling → Supervised Classifier

This is the 2026 best practice!

When to Use Topic Modeling:

📰 Analyzing thousands of news articles, social media posts, reviews
🔍 Discovering what themes exist (exploratory analysis)
📊 Tracking how topics evolve over time
🏷️ Creating training data for supervised classifiers

Critical Thinking:

⚠️ What gets lost in categorization?

Documents forced into single topics, outliers ignored, nuance reduced

⚠️ Algorithmic agenda setting

Volume ≠ importance. Topic ranking shapes what we think is newsworthy.

🎉

Lesson Complete!

0 XP

🎯 Badge Unlocked: Topic Tracker

You've learned to discover themes in large text collections and think critically about categorization.

What's Everyone Talking About?

📚 The Problem: Too Much Text

Scenario:

Manual Approaches (Don't Scale):

What you need:

🔍 What is Topic Modeling?

Definition:

How it works (simplified):

Example Output:

Topic 1

Topic 2

🎯 When to Use Topic Modeling

Perfect for:

Common pattern:

⚖️ Two Methods: LDA vs BERTopic

🏛️ LDA (2003)

🚀 BERTopic (2020)

Which should you use?

🏛️ Method 1: LDA (Latent Dirichlet Allocation)

Demo Dataset:

How LDA Works:

Key Limitation:

📊 LDA Results: r/technology

Topic 1

Topic 2

Topic 3

Pretty good!

But notice:

🎮 LDA Quiz

Question 1

Question 2

Question 3

🚀 Method 2: BERTopic (Modern Approach)

What makes BERTopic different?

Key Improvements Over LDA:

How BERTopic Works:

📊 BERTopic Results: Same Dataset

Topic 1: Large Language Models

Topic 2: iPhone Release & Features

Topic 3: Facebook/Meta Privacy Scandals

Notice the difference:

But still:

🎯 CommDAAF Checkpoint: What Gets Lost in Categorization?

📊 DISCOVER

🔍 ANALYZE

⚖️ ASSESS

🛠️ FORMULATE

🤖 LLM-Based Topic Labeling

The Problem So Far:

The 2026 Solution:

How it works:

Fully Automated Pipeline:

📊 LLM Labeling Results

Much clearer!

🎯 CommDAAF Checkpoint: Algorithmic Agenda Setting

📊 DISCOVER

🔍 ANALYZE

⚖️ ASSESS

🛠️ FORMULATE

🎮 Topic Guessing Game

Sample Titles:

Question

BERTopic's Answer: 6 topics

Topic 1: Large Language Models (15 posts)

Topic 2: Smartphone Releases (12 posts)

Topic 3: Privacy & Regulation (8 posts)

Human vs Algorithm:

🔄 The Complete Workflow (2026 Best Practice)

From Lesson 6 (Text Analytics), you learned:

The 2026 Solution:

Step-by-Step:

Why this workflow?

🎓 Key Takeaways

What You Learned:

When to Use Topic Modeling:

Critical Thinking:

Lesson Complete!