Master the art of ethical data collection in the post-API era. Learn to navigate legal boundaries and collect data responsibly.
GET https://api.twitter.com/2/tweets/search/recent?query=climate%20change&max_results=100
For over a decade, social platforms offered free APIs:
Result: Thousands of researchers lost access overnight
When you need digital data, you have 4 options:
✅ Legal, reliable, documented
❌ Expensive ($100-$42,000/month)
Best for: Companies with budgets
✅ Free, pre-collected, high quality
❌ Limited topics, may be outdated
Best for: Students, researchers
Sources: Kaggle, GitHub, universities
✅ Free, flexible, you control it
❌ Slower, setup required, legal gray area
Best for: Small projects, specific needs
✅ Easier than DIY, more affordable
❌ Still costs money, less control
Examples: Apify, Bright Data, ScraperAPI
Why? Because it's:
Website shows: "Mario's Pizza - ⭐⭐⭐⭐⭐ - 'Best pizza in town!'"
HTML code:
<div class="restaurant">
<h3>Mario's Pizza</h3>
<span class="rating">5 stars</span>
<p class="review">Best pizza in town!</p>
</div>
Scraper extracts:
| Name | Rating | Review |
|---|---|---|
| Mario's Pizza | 5 | Best pizza in town! |
No coding required (we'll use browser extensions)
Web scraping legality varies by:
Research recent cases:
What did you learn about scraping legality?
Look at these 3 scenarios:
For each:
Which scenarios are legally risky? Why?
Even if something is LEGAL, is it ETHICAL?
Consider:
When does "it's public data" NOT make scraping ethical?
Create your own "Ethical Scraping Checklist":
Before scraping any website, I will:
Add 3 more rules to this checklist:
You can skip this checkpoint, but you won't earn the 20 XP reward.
Every website can publish a robots.txt file that says:
Add /robots.txt to any domain:
User-agent: *
Disallow: /private/
Disallow: /admin/
Crawl-delay: 10
User-agent: Googlebot
Allow: /
If robots.txt says Disallow, DON'T scrape it.
Respecting robots.txt is:
We're going to check real websites' robots.txt files.
Open: https://www.reddit.com/robots.txt
What does Reddit's robots.txt say about crawl-delay?
Scenario: You want to scrape LinkedIn job postings.
Check: https://www.linkedin.com/robots.txt
What do you find?
The easiest way to scrape: Browser extensions
Visit Chrome Web Store or Firefox Add-ons and install "Web Scraper" extension
Sample URL: https://www.yelp.com/search?find_desc=pizza&find_loc=New+York
Click extension icon → "Create new sitemap"
Point and click on:
See a table with extracted data
Download as CSV
Try scraping 10 coffee shops in your city using one of these tools!
(This is optional practice - you can continue the lesson now)
If you scrape too fast, you can:
❌ Bad: Scraping 1,000 pages per second
✅ Good: Scraping 1 page every 2-3 seconds
Scraping = walking into a library
✅ Good: Taking photos of 10 book pages
❌ Bad: Using a copy machine to duplicate the entire library
Be respectful.
You want to analyze sentiment on Yelp reviews to understand customer preferences.
You plan to scrape 1,000 reviews from 100 restaurants.
Yelp's robots.txt says: Crawl-delay: 1
What do you do?
Your competitor has pricing data on their public website.
Your boss wants you to scrape their prices daily to undercut them.
What do you do?
You want to study misinformation on Facebook public groups.
Facebook's ToS prohibits scraping.
What do you do?
You're scraping Twitter using a third-party tool that bypasses Twitter's rate limits.
The tool uses residential proxies to avoid detection.
What do you do?
Web scraping shifts power between:
Who should control digital behavioral data?
Think about:
Compare two scenarios:
PhD student scrapes 1M tweets for dissertation on social movements.
Twitter has free API. Legal, ethical, celebrated.
Same PhD student, same research goal.
Twitter API costs $42,000/month.
Student can't afford it. Uses scraper. Violates ToS.
What changed?
Is it ethical for platforms to paywall academic research?
Consider the perspectives:
Who's right?
Design a better system:
How should platforms balance:
Propose a solution:
You can skip this checkpoint, but you won't earn the 20 XP reward.
| Advantages | Disadvantages |
|---|---|
| ✅ No legal gray area | ❌ Limited to existing topics |
| ✅ Often pre-cleaned | ❌ May be outdated |
| ✅ Citeable (academic credibility) | ❌ Less control over what's included |
Just because you CAN scrape doesn't mean you SHOULD.
Consider:
Be a responsible data collector.
Examples: Coffee shops in your city, movie reviews, tech news headlines
Yelp, Google Maps, IMDB, Reddit, news sites
Add /robots.txt to the domain and review the rules
Extract at least 50 data points
Create a CSV with at least 3 columns
Next lesson (Text Analytics), you'll analyze sentiment on your collected data.
+30 XP for completing this exercise (tracked outside this lesson)
This exercise is optional. You can skip it and continue to completion.
You now know how to ethically collect data in the post-API era. You understand legal boundaries, robots.txt, and responsible scraping practices.