Discovering Themes with Topic Modeling
You've collected 10,000 Reddit posts about climate change.
How do you find the main themes?
You can't read them all by hand.
You're analyzing customer feedback for a product launch:
35,000 pieces of text. What are customers actually complaining about?
Option 1: Read everything
⏱️ Time: ~300 hours (at 2 minutes per document)
❌ Not realistic
Option 2: Random sample
📊 Read 100 random documents, guess themes
⚠️ Might miss rare but important issues
A method that automatically discovers themes without manual reading or pre-defined categories.
Enter: Topic Modeling
Topic modeling is a machine learning technique that automatically discovers themes (topics) in a collection of documents.
→ Likely about battery issues
→ Likely about screen damage
The magic: The algorithm didn't know these categories existed. It discovered them from patterns in the text.
📰 News Analysis
"What topics dominated news coverage in 2025?"
📱 Social Media Research
"What are people discussing about my brand?"
🔬 Academic Research
"What are the major themes in 20 years of climate research?"
You have lots of text, you want to know what themes exist, and you don't want to read everything manually.
We'll learn two topic modeling approaches:
Latent Dirichlet Allocation
The "classic" method, still widely used
How it works:
Pros: Fast, interpretable, battle-tested
Cons: Ignores word order, requires choosing # of topics
Modern deep learning approach
State-of-the-art in 2026
How it works:
Pros: More accurate, understands context, auto-determines topics
Cons: Slower, requires more setup
We'll demo both so you can choose!
1,000 Reddit posts from r/technology
Topics covered: AI, smartphones, privacy, electric vehicles, social media...
Goal: Let LDA discover these topics automatically
Clean text, remove stop words ("the", "is", "and"), create bag-of-words
You decide: "I want to find 8 topics"
(This is a limitation - you must guess in advance)
Iteratively assigns words to topics based on co-occurrence patterns
8 topics, each represented by top 10 words
Topics are labeled as "Topic 1", "Topic 2", etc.
YOU must interpret what each topic means by looking at the words.
Here's what LDA discovered from 1,000 r/technology posts (K=8):
Human interpretation: AI/LLMs
Human interpretation: Smartphones
Human interpretation: Privacy concerns
LDA found coherent topics without being told what to look for.
What is LDA's main input that you must decide in advance?
How does LDA represent topics?
What is a major limitation of LDA?
BERTopic uses embeddings - numerical representations that capture word meaning and context.
1. Understands Context
LDA sees "bank" and doesn't know if it means:
BERTopic uses context to tell them apart.
2. Auto-Determines # of Topics
LDA: You choose K=8
BERTopic: "I found 12 distinct topics in your data"
3. Better Topic Coherence
Topics make more sense semantically because embeddings capture meaning.
Convert each document to a 768-dimensional vector (using BERT or similar)
Use UMAP to reduce 768 dimensions to 2D (for visualization/clustering)
Use HDBSCAN to find natural clusters = topics
Find words that best represent each cluster
Running BERTopic on the same 1,000 r/technology posts:
More specific than LDA's "AI" - focuses on LLMs specifically
More coherent - focuses on iPhone specifically, not all smartphones
Specific to Facebook privacy issues, not general privacy
Topics are still just word lists. Humans must interpret them.
Next: Can we automate the labeling too?
Run topic modeling on this scenario:
Dataset: 5,000 customer support tickets
Result: BERTopic finds 8 clean topics (billing, shipping, returns, etc.)
But: 500 tickets (10%) don't fit any topic cleanly - assigned to "Outliers"
What happens to these 500 "outlier" tickets?
Consider this scenario:
A customer writes: "The product is great, but the packaging uses too much plastic and I'm concerned about sustainability."
Which topic does this fit?
Topic modeling will force this into ONE category. What nuance gets lost?
Who decides the "right" number of topics?
How does the choice of K shape what you "discover" in the data?
Design a topic modeling system that preserves nuance:
How would you handle:
Propose a solution that balances categorization (for scale) with nuance (for accuracy).
Both LDA and BERTopic give you word lists:
Topic 1: chatgpt, openai, gpt-4, llm, claude, hallucination
YOU must interpret: "This is probably about large language models"
Use LLMs (GPT-4, Claude) to automatically label topics!
BERTopic gives you: ["chatgpt", "openai", "gpt-4", "llm", "claude"]
Pull 5-10 representative documents from this topic
"Based on these keywords and sample documents, provide a concise 2-4 word label for this topic."
"Large Language Models"
BERTopic discovers topics → LLM labels them → You get human-readable results
Here's BERTopic output before and after LLM labeling:
| BERTopic Keywords | LLM Label |
|---|---|
| chatgpt, openai, gpt-4, llm, claude | Large Language Models |
| iphone 16, apple, release, camera, specs | iPhone Product Launch |
| facebook, meta, zuckerberg, tracking, lawsuit | Facebook Privacy Scandals |
| tesla, ev, electric, autopilot, musk | Tesla & Electric Vehicles |
| tiktok, ban, congress, national security | TikTok Ban Debate |
Now anyone can understand the topics without needing to interpret keywords.
Question: How accurate are LLM labels?
Answer: Compare to human labels:
Good enough for most use cases!
News aggregators (Google News, Apple News, Flipboard) use topic modeling to cluster stories.
Scenario:
Google News analyzes 100,000 articles daily and groups them into topics.
Homepage shows: Top 10 topics (based on # of articles)
You only see: What the algorithm decided were the "main topics"
Which topics get prominence? Which get buried?
Topic modeling can shape agenda setting - what people think is important.
Volume vs. Importance:
Example:
2025 Twitter: 1M tweets about Taylor Swift's tour
2025 Twitter: 10K tweets about new climate report
Topic modeling shows: "Taylor Swift" is 100x bigger topic
Does this reflect what's actually newsworthy?
Can topic modeling shape what people think is "newsworthy"?
Feedback loop:
Does this amplify existing biases in news coverage?
Design a transparent topic discovery system for journalism:
How would you balance:
Propose a topic ranking system that goes beyond simple volume.
You're given 50 Reddit post titles. How many distinct topics do you think there are?
How many topics do you think exist in these 50 posts?
Supervised classifiers need labeled training data
Problem: Labeling 10,000 documents by hand is slow
Topic Modeling → LLM Labeling → Supervised Classifier
Run BERTopic on 10,000 unlabeled documents
→ Discovers 8 topics automatically
Use GPT-4/Claude to label the 8 topics
→ Get human-readable labels
Assign each document to its most likely topic
→ Now you have 10,000 labeled examples!
Train a fast classifier (logistic regression, random forest)
→ Can classify new documents in real-time
Use classifier for new data (fast, cheap)
Periodically re-run topic modeling to find new emerging topics
This is how professionals do it in 2026!
✅ Topic Modeling
Automatically discovers themes in large text collections without manual reading or pre-defined categories
✅ Two Methods
✅ LLM Labeling
Use GPT-4/Claude to automatically generate human-readable topic labels from keywords
✅ Production Workflow
Topic Modeling → LLM Labeling → Supervised Classifier
This is the 2026 best practice!
⚠️ What gets lost in categorization?
Documents forced into single topics, outliers ignored, nuance reduced
⚠️ Algorithmic agenda setting
Volume ≠ importance. Topic ranking shapes what we think is newsworthy.
You've learned to discover themes in large text collections and think critically about categorization.