Unit 0: Digital Foundations

🔍

Data Hunt
Getting Data When APIs Say No

Master the art of ethical data collection in the post-API era. Learn to navigate legal boundaries and collect data responsibly.

🔌 What are APIs?

Definition:
API (Application Programming Interface) is a set of rules and protocols that allows different software applications to communicate with each other.

Think of it like a restaurant:

Kitchen = The platform's database (Twitter, Reddit, etc.)
Menu = What data you can request
Waiter = The API (takes your request, brings back data)
You = Your application/research project

How APIs work (simplified):

You send a request: "Give me the latest 100 tweets about climate change"
API processes it: Checks if you're allowed, validates the request
Server responds: Returns the data in a structured format (usually JSON)
You use the data: Analyze, visualize, or store it

Example Request:

GET https://api.twitter.com/2/tweets/search/recent?query=climate%20change&max_results=100

Why APIs are powerful:

✅ Structured data: Clean, organized, ready to analyze
✅ Reliable: Official source, no need to scrape HTML
✅ Efficient: Get exactly what you need, nothing more
✅ Documented: Clear instructions on how to use it

APIs used to be FREE. Then things changed...

📜 The API Golden Age (2010-2022)

For over a decade, social platforms offered free APIs:

Twitter API (2006-2023):

Relatively unrestricted API access, especially for academic researchers
Researchers, developers, students could analyze millions of tweets
PhD students built entire careers on Twitter data

Reddit API (2008-2023):

Free access to posts, comments, voting patterns
Anyone could download r/politics or r/science discussions

Why did platforms offer free APIs?

Growth strategy - more developers = more users
Goodwill with researchers
"Open internet" ethos

Then everything changed...

💰 The Great API Paywall (2023-2024)

January 2023: Twitter announces API pricing

Free tier: Eliminated
Basic tier: $100/month (severely limited)
Enterprise tier: $42,000/month

Result: Thousands of researchers lost access overnight

July 2023: Reddit follows suit

API pricing: $0.24 per 1,000 API calls
Apps like Apollo shut down (couldn't afford $20M/year)

2024: Instagram, TikTok, YouTube tighten API access

Stricter rate limits
More expensive tiers
Academic research exception removed

Why the shift?

AI training gold rush: Platforms realized their data = training gold
Elon Musk's Twitter takeover: "Data scraping" became the enemy
Revenue pressure: APIs are now profit centers, not goodwill

The result: If you want digital behavioral data in 2026, you need new strategies.

🎯 Your Data Collection Options (2026)

When you need digital data, you have 4 options:

💰 Pay for APIs

✅ Legal, reliable, documented

❌ Expensive ($100-$42,000/month)

Best for: Companies with budgets

📚 Public Datasets

✅ Free, pre-collected, high quality

❌ Limited topics, may be outdated

Best for: Students, researchers

Sources: Kaggle, GitHub, universities

🛠️ Browser Extensions

✅ Free, flexible, you control it

❌ Slower, setup required, legal gray area

Best for: Small projects, specific needs

🏢 Third-Party Services

✅ Easier than DIY, more affordable

❌ Still costs money, less control

Examples: Apify, Bright Data, ScraperAPI

This lesson focuses on Option 3: Browser Extensions

Why? Because it's:

✅ Accessible to non-coders
✅ Free
✅ Flexible for most use cases

But it comes with RESPONSIBILITIES.

🕸️ What is Web Scraping?

Definition:
Automatically extracting data from websites by reading the HTML structure.

How it works:

You visit a website (e.g., Yelp restaurant reviews)
Your browser loads the HTML code
A scraper tool extracts specific data (restaurant names, ratings, review text)
Data is saved to a spreadsheet or database

Example:

Website shows: "Mario's Pizza - ⭐⭐⭐⭐⭐ - 'Best pizza in town!'"

HTML code:

<div class="restaurant">
  <h3>Mario's Pizza</h3>
  <span class="rating">5 stars</span>
  <p class="review">Best pizza in town!</p>
</div>

Scraper extracts:

Name	Rating	Review
Mario's Pizza	5	Best pizza in town!

No coding required (we'll use browser extensions)

But first... is this LEGAL?

🎯 CommDAAF Checkpoint: Is Web Scraping Legal?

📊 DISCOVER

Web scraping legality varies by:

What you scrape (public vs private data)
How you scrape (rate limits, robots.txt compliance)
Why you scrape (research vs commercial use)
Where you are (US vs EU vs other countries)

Research recent cases:

hiQ vs LinkedIn (2022): Court ruled scraping public data is legal
Meta vs Bright Data (ongoing): Scraping lawsuit
GDPR (EU): Stricter rules on personal data collection

What did you learn about scraping legality?

🔍 ANALYZE

Look at these 3 scenarios:

Scenario A: Scraping Yelp restaurant names and ratings (public info)
Scenario B: Scraping Facebook profiles (requires login)
Scenario C: Scraping competitor pricing from e-commerce sites

For each:

Is this data public or behind authentication?
Is there a legitimate research/business purpose?
Does the website's Terms of Service prohibit it?

Which scenarios are legally risky? Why?

⚖️ ASSESS

Even if something is LEGAL, is it ETHICAL?

Consider:

Attribution: Should you credit the source?
Impact: Could scraping slow down or crash the website?
Privacy: Does the data include personal information people didn't intend to share?
Consent: Did users expect their public posts to be used in research/analysis?

When does "it's public data" NOT make scraping ethical?

🛠️ FORMULATE

Create your own "Ethical Scraping Checklist":

Before scraping any website, I will:

☐ Check robots.txt
☐ Read Terms of Service
☐ _____________________
☐ _____________________
☐ _____________________

Add 3 more rules to this checklist:

You can skip this checkpoint, but you won't earn the 20 XP reward.

💾 Your responses are saved to your learning journal

🤖 robots.txt - The Website's Rule Book

Every website can publish a robots.txt file that says:

What scrapers/bots are allowed
What sections of the site can be accessed
How often you can make requests

How to find it:

Add /robots.txt to any domain:

https://www.reddit.com/robots.txt
https://www.amazon.com/robots.txt
https://www.nytimes.com/robots.txt

Example robots.txt:

User-agent: *
Disallow: /private/
Disallow: /admin/
Crawl-delay: 10

User-agent: Googlebot
Allow: /

Translation:

All bots: Don't access /private/ or /admin/
All bots: Wait 10 seconds between requests
Google's bot: Full access allowed

Golden Rule:

If robots.txt says Disallow, DON'T scrape it.

Respecting robots.txt is:

✅ Ethical
✅ Reduces legal risk
✅ Industry standard

🎮 robots.txt Scavenger Hunt

We're going to check real websites' robots.txt files.

Question 1

Open: https://www.reddit.com/robots.txt

What does Reddit's robots.txt say about crawl-delay?

A) No crawl-delay specified

B) 1 second

C) 10 seconds

D) Disallows all bots

Question 2

Check: https://www.amazon.com/robots.txt

Can you scrape Amazon product prices?

A) Yes, no restrictions

B) No, explicitly disallowed

C) Only if you're Googlebot

D) Only at 1 request per minute

Question 3

Scenario: You want to scrape LinkedIn job postings.

Check: https://www.linkedin.com/robots.txt

What do you find?

A) LinkedIn allows scraping

B) LinkedIn disallows most bots

C) LinkedIn requires API use

D) No robots.txt file

🛠️ Browser Extensions for Scraping

The easiest way to scrape: Browser extensions

Recommended Tools:

1. Web Scraper (Chrome/Firefox) - FREE

Point-and-click interface
No coding required
Export to CSV

2. Easy Scraper - FREE

Even simpler than Web Scraper
One-click data extraction
Detects tables automatically

3. Data Miner - Freemium

Pre-built "recipes" for popular sites
Yelp, Amazon, Google Maps templates
Free tier: 500 pages/month

How they work:

Install extension
Visit the website you want to scrape
Click extension icon
Select data to extract (click on elements)
Run scraper
Download CSV

We'll do a live demo next.

🎬 Scraping Demo: Yelp Restaurants

Goal: Extract restaurant names, ratings, and review counts from Yelp

Step-by-Step:

STEP 1: Install Extension

Visit Chrome Web Store or Firefox Add-ons and install "Web Scraper" extension

STEP 2: Visit Yelp

Sample URL: https://www.yelp.com/search?find_desc=pizza&find_loc=New+York

STEP 3: Open Scraper

Click extension icon → "Create new sitemap"

STEP 4: Select Data

Point and click on:

Restaurant name
Star rating
Number of reviews
Price range ($, $$, $$$)

STEP 5: Preview Results

See a table with extracted data

STEP 6: Export

Download as CSV

Your Turn:

Try scraping 10 coffee shops in your city using one of these tools!

(This is optional practice - you can continue the lesson now)

⏱️ Rate Limiting - Don't Crash Websites

The Problem:

If you scrape too fast, you can:

Slow down or crash the website
Get your IP address banned
Violate Terms of Service
Create legal liability

Example:

❌ Bad: Scraping 1,000 pages per second

✅ Good: Scraping 1 page every 2-3 seconds

Rate Limiting Best Practices:

✅ Respect crawl-delay in robots.txt

✅ Add delays between requests (1-5 seconds)

✅ Scrape during off-peak hours (2am-6am)

✅ Use pagination limits (don't scrape entire site)

✅ Stop if you get rate limit errors (HTTP 429)

DON'T:

Run scripts 24/7
Scrape entire archives
Ignore 429 errors (rate limit exceeded)

Analogy:

Scraping = walking into a library

✅ Good: Taking photos of 10 book pages

❌ Bad: Using a copy machine to duplicate the entire library

Be respectful.

🎮 Ethical Scenarios: What Would You Do?

Scenario 1

You want to analyze sentiment on Yelp reviews to understand customer preferences.

You plan to scrape 1,000 reviews from 100 restaurants.

Yelp's robots.txt says: Crawl-delay: 1

What do you do?

A) Scrape with 1-second delay (respects robots.txt)

B) Scrape with no delay (faster)

C) Use Yelp's API (costs $500/month)

D) Don't scrape, find alternative dataset

Scenario 2

Your competitor has pricing data on their public website.

Your boss wants you to scrape their prices daily to undercut them.

What do you do?

A) Scrape daily (public data, fair game)

B) Scrape weekly (less aggressive)

C) Manually check prices (no scraping)

D) Refuse - this is unethical competitive intelligence

Scenario 3

You want to study misinformation on Facebook public groups.

Facebook's ToS prohibits scraping.

What do you do?

A) Scrape anyway for research purposes

B) Apply for Facebook research API access

C) Manually collect data (screenshot posts)

D) Find alternative platform (e.g., Reddit)

Scenario 4

You're scraping Twitter using a third-party tool that bypasses Twitter's rate limits.

The tool uses residential proxies to avoid detection.

What do you do?

A) Use it (convenient, effective)

B) Stop (ethically questionable)

C) Check if tool violates Twitter ToS

D) Report tool to Twitter

🎯 CommDAAF Checkpoint: Who Has the Power?

📊 DISCOVER

Web scraping shifts power between:

Platforms (control data access)
Researchers (need data for public good)
Companies (want competitive intelligence)
Users (created the data)

Who should control digital behavioral data?

Think about:

Twitter users posted tweets publicly
But Twitter now charges $42,000/month for API access
Does Twitter "own" user-generated content?

🔍 ANALYZE

Compare two scenarios:

Scenario A - 2015:

PhD student scrapes 1M tweets for dissertation on social movements.
Twitter has free API. Legal, ethical, celebrated.

Scenario B - 2024:

Same PhD student, same research goal.
Twitter API costs $42,000/month.
Student can't afford it. Uses scraper. Violates ToS.

What changed?

Same data (public tweets)
Same research purpose
Different business model

Is it ethical for platforms to paywall academic research?

⚖️ ASSESS

Consider the perspectives:

Platform's view:
"Our data infrastructure costs millions. AI companies scraped our data to train models. We deserve compensation."

Researcher's view:
"This is public interest research. Users posted publicly. Paywalling data kills academic research."

User's view:
"I posted this publicly, but I didn't consent to it being sold or trained into AI."

Who's right?

🛠️ FORMULATE

Design a better system:

How should platforms balance:

Protecting infrastructure from abuse
Enabling legitimate research
Respecting user intent
Preventing commercial exploitation

Propose a solution:

Free tier for academics?
Rate limits instead of paywalls?
User consent mechanisms?
Public data commons?

You can skip this checkpoint, but you won't earn the 20 XP reward.

💾 Your responses are saved to your learning journal

📚 Alternative: Public Datasets

Before you scrape, check if someone already collected the data!

Where to Find Public Datasets:

1. Kaggle (kaggle.com/datasets)

Millions of datasets
Examples: Reddit posts, Twitter sentiment, Yelp reviews
Free, high quality, often pre-cleaned
Search: "reddit", "twitter", "social media"

2. GitHub

Researchers share datasets in repos
Search: "dataset reddit", "twitter dataset 2024"
Example: Pushshift Reddit Archive (until 2023)

3. University Data Repositories

OSoMe (Indiana University): Twitter datasets, bot detection data
SocialMediaLab: Canadian political social media data

4. Government Open Data

data.gov (US)
data.europa.eu (EU)
Contains survey data, census, public health

5. Academic Papers with Data

Search Google Scholar for papers on your topic
Many include dataset links or supplementary materials

Advantages	Disadvantages
✅ No legal gray area	❌ Limited to existing topics
✅ Often pre-cleaned	❌ May be outdated
✅ Citeable (academic credibility)	❌ Less control over what's included

🎯 Choosing the Right Data Collection Method

START: I need digital behavioral data

↓

Q1: Is there a public dataset that meets my needs?

└─ YES → Use Kaggle/GitHub/OSoMe → ✅ DONE
└─ NO → Continue

↓

Q2: Do I have budget for APIs?

└─ YES ($100-$42k/month) → Use official API → ✅ DONE
└─ NO → Continue

↓

Q3: Is the data on a public website (no login required)?

└─ NO → Stop. Don't scrape private data.
└─ YES → Continue

↓

Q4: What does robots.txt say?

└─ "Disallow" → Find alternative
└─ "Allow" or silent → Continue

↓

Q5: Is this for commercial use or research?

└─ Commercial → Consult lawyer (legal risk)
└─ Research → Continue

↓

Q6: How much data do you need?

└─ < 100 items → Manual collection (copy-paste)
└─ 100-10,000 items → Browser extension scraping → ✅ START HERE
└─ > 10,000 items → Third-party service (Apify, ScraperAPI)

Most Common Path for Students/Researchers:
Public dataset → Browser extension scraping (with ethical guidelines)

🎓 Key Takeaways

The Post-API Era:

Free APIs are dead (Twitter, Reddit paywalled)
You need new strategies to collect data

Your Options:

Public datasets (Kaggle, GitHub, university repos)
Browser extension scraping (Web Scraper, Easy Scraper)
Third-party services (Apify, ScraperAPI)
Paid APIs (expensive but legal)

Ethical Scraping Checklist:

✅ Check robots.txt

✅ Respect rate limits (1-5 second delays)

✅ Scrape only public data

✅ Read Terms of Service

✅ Have legitimate research/business purpose

✅ Don't crash websites

✅ Anonymize personal data

Legal Reality:

🟢 Scraping public data for research: Generally legal

🟡 Scraping public data for commercial use: Gray area

🔴 Scraping private data (behind login): Illegal

The Golden Rule:

Just because you CAN scrape doesn't mean you SHOULD.

Consider:

Legal risk
Ethical implications
Impact on the website
User consent

Be a responsible data collector.

💪 Practical Exercise (Optional)

Challenge: Collect data on your favorite topic

Instructions:

STEP 1: Choose a topic

Examples: Coffee shops in your city, movie reviews, tech news headlines

STEP 2: Find a source

Yelp, Google Maps, IMDB, Reddit, news sites

STEP 3: Check robots.txt

Add /robots.txt to the domain and review the rules

STEP 4: Choose your method

Manual (< 100 items)
Browser extension (100-10,000)
Public dataset (search Kaggle first)

STEP 5: Collect data

Extract at least 50 data points

STEP 6: Clean data

Create a CSV with at least 3 columns

We'll use this data in future lessons!

Next lesson (Text Analytics), you'll analyze sentiment on your collected data.

+30 XP for completing this exercise (tracked outside this lesson)

This exercise is optional. You can skip it and continue to completion.

🏆

Lesson Complete!

0 XP

🔍 Badge Unlocked: Data Detective

You now know how to ethically collect data in the post-API era. You understand legal boundaries, robots.txt, and responsible scraping practices.

Next lesson: Unit 1, Lesson 1 - Google Trends Analysis

Data HuntGetting Data When APIs Say No

🔌 What are APIs?

Think of it like a restaurant:

How APIs work (simplified):

Example Request:

Why APIs are powerful:

📜 The API Golden Age (2010-2022)

Why did platforms offer free APIs?

Then everything changed...

💰 The Great API Paywall (2023-2024)

Why the shift?

🎯 Your Data Collection Options (2026)

💰 Pay for APIs

📚 Public Datasets

🛠️ Browser Extensions

🏢 Third-Party Services

This lesson focuses on Option 3: Browser Extensions

🕸️ What is Web Scraping?

How it works:

Example:

🎯 CommDAAF Checkpoint: Is Web Scraping Legal?

📊 DISCOVER

🔍 ANALYZE

⚖️ ASSESS

🛠️ FORMULATE

🤖 robots.txt - The Website's Rule Book

How to find it:

Example robots.txt:

Translation:

Golden Rule:

🎮 robots.txt Scavenger Hunt

Question 1

Question 2

Question 3

🛠️ Browser Extensions for Scraping

Recommended Tools:

1. Web Scraper (Chrome/Firefox) - FREE

2. Easy Scraper - FREE

3. Data Miner - Freemium

How they work:

🎬 Scraping Demo: Yelp Restaurants

Step-by-Step:

Your Turn:

⏱️ Rate Limiting - Don't Crash Websites

The Problem:

Example:

Rate Limiting Best Practices:

DON'T:

Analogy:

🎮 Ethical Scenarios: What Would You Do?

Scenario 1

Scenario 2

Scenario 3

Scenario 4

🎯 CommDAAF Checkpoint: Who Has the Power?

📊 DISCOVER

🔍 ANALYZE

⚖️ ASSESS

🛠️ FORMULATE

📚 Alternative: Public Datasets

Where to Find Public Datasets:

1. Kaggle (kaggle.com/datasets)

2. GitHub

3. University Data Repositories

4. Government Open Data

5. Academic Papers with Data

🎯 Choosing the Right Data Collection Method

🎓 Key Takeaways

The Post-API Era:

Your Options:

Ethical Scraping Checklist:

Legal Reality:

The Golden Rule:

💪 Practical Exercise (Optional)

Instructions:

We'll use this data in future lessons!

Lesson Complete!

Data Hunt
Getting Data When APIs Say No