← Dashboard
🔍 Data Hunt - Getting Data When APIs Say No
🏆 0 XP
Unit 0: Digital Foundations
🔍

Data Hunt
Getting Data When APIs Say No

Master the art of ethical data collection in the post-API era. Learn to navigate legal boundaries and collect data responsibly.

🔌 What are APIs?

Definition:
API (Application Programming Interface) is a set of rules and protocols that allows different software applications to communicate with each other.

Think of it like a restaurant:

  • Kitchen = The platform's database (Twitter, Reddit, etc.)
  • Menu = What data you can request
  • Waiter = The API (takes your request, brings back data)
  • You = Your application/research project

How APIs work (simplified):

  1. You send a request: "Give me the latest 100 tweets about climate change"
  2. API processes it: Checks if you're allowed, validates the request
  3. Server responds: Returns the data in a structured format (usually JSON)
  4. You use the data: Analyze, visualize, or store it

Example Request:

GET https://api.twitter.com/2/tweets/search/recent?query=climate%20change&max_results=100

Why APIs are powerful:

  • Structured data: Clean, organized, ready to analyze
  • Reliable: Official source, no need to scrape HTML
  • Efficient: Get exactly what you need, nothing more
  • Documented: Clear instructions on how to use it
APIs used to be FREE. Then things changed...

📜 The API Golden Age (2010-2022)

For over a decade, social platforms offered free APIs:

Twitter API (2006-2023):
  • Relatively unrestricted API access, especially for academic researchers
  • Researchers, developers, students could analyze millions of tweets
  • PhD students built entire careers on Twitter data
Reddit API (2008-2023):
  • Free access to posts, comments, voting patterns
  • Anyone could download r/politics or r/science discussions

Why did platforms offer free APIs?

  • Growth strategy - more developers = more users
  • Goodwill with researchers
  • "Open internet" ethos

Then everything changed...

💰 The Great API Paywall (2023-2024)

January 2023: Twitter announces API pricing
  • Free tier: Eliminated
  • Basic tier: $100/month (severely limited)
  • Enterprise tier: $42,000/month

Result: Thousands of researchers lost access overnight

July 2023: Reddit follows suit
  • API pricing: $0.24 per 1,000 API calls
  • Apps like Apollo shut down (couldn't afford $20M/year)
2024: Instagram, TikTok, YouTube tighten API access
  • Stricter rate limits
  • More expensive tiers
  • Academic research exception removed

Why the shift?

  1. AI training gold rush: Platforms realized their data = training gold
  2. Elon Musk's Twitter takeover: "Data scraping" became the enemy
  3. Revenue pressure: APIs are now profit centers, not goodwill
The result: If you want digital behavioral data in 2026, you need new strategies.

🎯 Your Data Collection Options (2026)

When you need digital data, you have 4 options:

💰 Pay for APIs

✅ Legal, reliable, documented

❌ Expensive ($100-$42,000/month)

Best for: Companies with budgets

📚 Public Datasets

✅ Free, pre-collected, high quality

❌ Limited topics, may be outdated

Best for: Students, researchers

Sources: Kaggle, GitHub, universities

🛠️ Browser Extensions

✅ Free, flexible, you control it

❌ Slower, setup required, legal gray area

Best for: Small projects, specific needs

🏢 Third-Party Services

✅ Easier than DIY, more affordable

❌ Still costs money, less control

Examples: Apify, Bright Data, ScraperAPI

This lesson focuses on Option 3: Browser Extensions

Why? Because it's:

  • ✅ Accessible to non-coders
  • ✅ Free
  • ✅ Flexible for most use cases
But it comes with RESPONSIBILITIES.

🕸️ What is Web Scraping?

Definition:
Automatically extracting data from websites by reading the HTML structure.

How it works:

  1. You visit a website (e.g., Yelp restaurant reviews)
  2. Your browser loads the HTML code
  3. A scraper tool extracts specific data (restaurant names, ratings, review text)
  4. Data is saved to a spreadsheet or database

Example:

Website shows: "Mario's Pizza - ⭐⭐⭐⭐⭐ - 'Best pizza in town!'"

HTML code:

<div class="restaurant">
  <h3>Mario's Pizza</h3>
  <span class="rating">5 stars</span>
  <p class="review">Best pizza in town!</p>
</div>

Scraper extracts:

Name Rating Review
Mario's Pizza 5 Best pizza in town!

No coding required (we'll use browser extensions)

But first... is this LEGAL?

🎯 CommDAAF Checkpoint: Is Web Scraping Legal?

📊 DISCOVER

Web scraping legality varies by:

  • What you scrape (public vs private data)
  • How you scrape (rate limits, robots.txt compliance)
  • Why you scrape (research vs commercial use)
  • Where you are (US vs EU vs other countries)

Research recent cases:

  • hiQ vs LinkedIn (2022): Court ruled scraping public data is legal
  • Meta vs Bright Data (ongoing): Scraping lawsuit
  • GDPR (EU): Stricter rules on personal data collection

What did you learn about scraping legality?

🔍 ANALYZE

Look at these 3 scenarios:

  • Scenario A: Scraping Yelp restaurant names and ratings (public info)
  • Scenario B: Scraping Facebook profiles (requires login)
  • Scenario C: Scraping competitor pricing from e-commerce sites

For each:

  • Is this data public or behind authentication?
  • Is there a legitimate research/business purpose?
  • Does the website's Terms of Service prohibit it?

Which scenarios are legally risky? Why?

⚖️ ASSESS

Even if something is LEGAL, is it ETHICAL?

Consider:

  • Attribution: Should you credit the source?
  • Impact: Could scraping slow down or crash the website?
  • Privacy: Does the data include personal information people didn't intend to share?
  • Consent: Did users expect their public posts to be used in research/analysis?

When does "it's public data" NOT make scraping ethical?

🛠️ FORMULATE

Create your own "Ethical Scraping Checklist":

Before scraping any website, I will:

  1. ☐ Check robots.txt
  2. ☐ Read Terms of Service
  3. ☐ _____________________
  4. ☐ _____________________
  5. ☐ _____________________

Add 3 more rules to this checklist:

You can skip this checkpoint, but you won't earn the 20 XP reward.

💾 Your responses are saved to your learning journal

🤖 robots.txt - The Website's Rule Book

Every website can publish a robots.txt file that says:

  • What scrapers/bots are allowed
  • What sections of the site can be accessed
  • How often you can make requests

How to find it:

Add /robots.txt to any domain:

  • https://www.reddit.com/robots.txt
  • https://www.amazon.com/robots.txt
  • https://www.nytimes.com/robots.txt

Example robots.txt:

User-agent: *
Disallow: /private/
Disallow: /admin/
Crawl-delay: 10

User-agent: Googlebot
Allow: /

Translation:

  • All bots: Don't access /private/ or /admin/
  • All bots: Wait 10 seconds between requests
  • Google's bot: Full access allowed

Golden Rule:

If robots.txt says Disallow, DON'T scrape it.

Respecting robots.txt is:

  • ✅ Ethical
  • ✅ Reduces legal risk
  • ✅ Industry standard

🎮 robots.txt Scavenger Hunt

We're going to check real websites' robots.txt files.

Question 1

Open: https://www.reddit.com/robots.txt

What does Reddit's robots.txt say about crawl-delay?

A) No crawl-delay specified
B) 1 second
C) 10 seconds
D) Disallows all bots

Question 2

Check: https://www.amazon.com/robots.txt

Can you scrape Amazon product prices?

A) Yes, no restrictions
B) No, explicitly disallowed
C) Only if you're Googlebot
D) Only at 1 request per minute

Question 3

Scenario: You want to scrape LinkedIn job postings.

Check: https://www.linkedin.com/robots.txt

What do you find?

A) LinkedIn allows scraping
B) LinkedIn disallows most bots
C) LinkedIn requires API use
D) No robots.txt file

🛠️ Browser Extensions for Scraping

The easiest way to scrape: Browser extensions

Recommended Tools:

How they work:

  1. Install extension
  2. Visit the website you want to scrape
  3. Click extension icon
  4. Select data to extract (click on elements)
  5. Run scraper
  6. Download CSV
We'll do a live demo next.

🎬 Scraping Demo: Yelp Restaurants

Goal: Extract restaurant names, ratings, and review counts from Yelp

Step-by-Step:

STEP 1: Install Extension

Visit Chrome Web Store or Firefox Add-ons and install "Web Scraper" extension

STEP 3: Open Scraper

Click extension icon → "Create new sitemap"

STEP 4: Select Data

Point and click on:

  • Restaurant name
  • Star rating
  • Number of reviews
  • Price range ($, $$, $$$)
STEP 5: Preview Results

See a table with extracted data

STEP 6: Export

Download as CSV

Your Turn:

Try scraping 10 coffee shops in your city using one of these tools!

(This is optional practice - you can continue the lesson now)

⏱️ Rate Limiting - Don't Crash Websites

The Problem:

If you scrape too fast, you can:

  • Slow down or crash the website
  • Get your IP address banned
  • Violate Terms of Service
  • Create legal liability

Example:

Bad: Scraping 1,000 pages per second

Good: Scraping 1 page every 2-3 seconds

Rate Limiting Best Practices:

Respect crawl-delay in robots.txt
Add delays between requests (1-5 seconds)
Scrape during off-peak hours (2am-6am)
Use pagination limits (don't scrape entire site)
Stop if you get rate limit errors (HTTP 429)

DON'T:

  • Run scripts 24/7
  • Scrape entire archives
  • Ignore 429 errors (rate limit exceeded)

Analogy:

Scraping = walking into a library

Good: Taking photos of 10 book pages

Bad: Using a copy machine to duplicate the entire library

Be respectful.

🎮 Ethical Scenarios: What Would You Do?

Scenario 1

You want to analyze sentiment on Yelp reviews to understand customer preferences.

You plan to scrape 1,000 reviews from 100 restaurants.

Yelp's robots.txt says: Crawl-delay: 1

What do you do?

A) Scrape with 1-second delay (respects robots.txt)
B) Scrape with no delay (faster)
C) Use Yelp's API (costs $500/month)
D) Don't scrape, find alternative dataset

Scenario 2

Your competitor has pricing data on their public website.

Your boss wants you to scrape their prices daily to undercut them.

What do you do?

A) Scrape daily (public data, fair game)
B) Scrape weekly (less aggressive)
C) Manually check prices (no scraping)
D) Refuse - this is unethical competitive intelligence

Scenario 3

You want to study misinformation on Facebook public groups.

Facebook's ToS prohibits scraping.

What do you do?

A) Scrape anyway for research purposes
B) Apply for Facebook research API access
C) Manually collect data (screenshot posts)
D) Find alternative platform (e.g., Reddit)

Scenario 4

You're scraping Twitter using a third-party tool that bypasses Twitter's rate limits.

The tool uses residential proxies to avoid detection.

What do you do?

A) Use it (convenient, effective)
B) Stop (ethically questionable)
C) Check if tool violates Twitter ToS
D) Report tool to Twitter

🎯 CommDAAF Checkpoint: Who Has the Power?

📊 DISCOVER

Web scraping shifts power between:

  • Platforms (control data access)
  • Researchers (need data for public good)
  • Companies (want competitive intelligence)
  • Users (created the data)

Who should control digital behavioral data?

Think about:

  • Twitter users posted tweets publicly
  • But Twitter now charges $42,000/month for API access
  • Does Twitter "own" user-generated content?

🔍 ANALYZE

Compare two scenarios:

Scenario A - 2015:

PhD student scrapes 1M tweets for dissertation on social movements.
Twitter has free API. Legal, ethical, celebrated.

Scenario B - 2024:

Same PhD student, same research goal.
Twitter API costs $42,000/month.
Student can't afford it. Uses scraper. Violates ToS.

What changed?

  • Same data (public tweets)
  • Same research purpose
  • Different business model

Is it ethical for platforms to paywall academic research?

⚖️ ASSESS

Consider the perspectives:

Platform's view:
"Our data infrastructure costs millions. AI companies scraped our data to train models. We deserve compensation."
Researcher's view:
"This is public interest research. Users posted publicly. Paywalling data kills academic research."
User's view:
"I posted this publicly, but I didn't consent to it being sold or trained into AI."

Who's right?

🛠️ FORMULATE

Design a better system:

How should platforms balance:

  • Protecting infrastructure from abuse
  • Enabling legitimate research
  • Respecting user intent
  • Preventing commercial exploitation

Propose a solution:

  • Free tier for academics?
  • Rate limits instead of paywalls?
  • User consent mechanisms?
  • Public data commons?

You can skip this checkpoint, but you won't earn the 20 XP reward.

💾 Your responses are saved to your learning journal

📚 Alternative: Public Datasets

Before you scrape, check if someone already collected the data!

Where to Find Public Datasets:

Advantages Disadvantages
✅ No legal gray area ❌ Limited to existing topics
✅ Often pre-cleaned ❌ May be outdated
✅ Citeable (academic credibility) ❌ Less control over what's included

🎯 Choosing the Right Data Collection Method

START: I need digital behavioral data
Q1: Is there a public dataset that meets my needs?
└─ YES → Use Kaggle/GitHub/OSoMe → ✅ DONE
└─ NO → Continue
Q2: Do I have budget for APIs?
└─ YES ($100-$42k/month) → Use official API → ✅ DONE
└─ NO → Continue
Q3: Is the data on a public website (no login required)?
└─ NO → Stop. Don't scrape private data.
└─ YES → Continue
Q4: What does robots.txt say?
└─ "Disallow" → Find alternative
└─ "Allow" or silent → Continue
Q5: Is this for commercial use or research?
└─ Commercial → Consult lawyer (legal risk)
└─ Research → Continue
Q6: How much data do you need?
└─ < 100 items → Manual collection (copy-paste)
└─ 100-10,000 items → Browser extension scraping → ✅ START HERE
└─ > 10,000 items → Third-party service (Apify, ScraperAPI)
Most Common Path for Students/Researchers:
Public dataset → Browser extension scraping (with ethical guidelines)

🎓 Key Takeaways

The Post-API Era:

  • Free APIs are dead (Twitter, Reddit paywalled)
  • You need new strategies to collect data

Your Options:

  1. Public datasets (Kaggle, GitHub, university repos)
  2. Browser extension scraping (Web Scraper, Easy Scraper)
  3. Third-party services (Apify, ScraperAPI)
  4. Paid APIs (expensive but legal)

Ethical Scraping Checklist:

Check robots.txt
Respect rate limits (1-5 second delays)
Scrape only public data
Read Terms of Service
Have legitimate research/business purpose
Don't crash websites
Anonymize personal data

Legal Reality:

🟢 Scraping public data for research: Generally legal
🟡 Scraping public data for commercial use: Gray area
🔴 Scraping private data (behind login): Illegal

The Golden Rule:

Just because you CAN scrape doesn't mean you SHOULD.

Consider:

  • Legal risk
  • Ethical implications
  • Impact on the website
  • User consent

Be a responsible data collector.

💪 Practical Exercise (Optional)

Challenge: Collect data on your favorite topic

Instructions:

STEP 1: Choose a topic

Examples: Coffee shops in your city, movie reviews, tech news headlines

STEP 2: Find a source

Yelp, Google Maps, IMDB, Reddit, news sites

STEP 3: Check robots.txt

Add /robots.txt to the domain and review the rules

STEP 4: Choose your method
  • Manual (< 100 items)
  • Browser extension (100-10,000)
  • Public dataset (search Kaggle first)
STEP 5: Collect data

Extract at least 50 data points

STEP 6: Clean data

Create a CSV with at least 3 columns

We'll use this data in future lessons!

Next lesson (Text Analytics), you'll analyze sentiment on your collected data.

+30 XP for completing this exercise (tracked outside this lesson)

This exercise is optional. You can skip it and continue to completion.

🏆

Lesson Complete!

0 XP
🔍 Badge Unlocked: Data Detective

You now know how to ethically collect data in the post-API era. You understand legal boundaries, robots.txt, and responsible scraping practices.

Next lesson: Unit 1, Lesson 1 - Google Trends Analysis