Learn to convert messy location data into powerful maps. Discover how spatial analysis reveals patterns invisible in other dataβand when it becomes surveillance.
Location data reveals patterns invisible in other data:
COVID-19 Tracking (2020-2022)
George Floyd Protests (2020)
Food Deserts Analysis
Election Results Mapping
Common thread: Location adds a spatial dimension that reveals WHERE things happen and HOW patterns spread across space.
Here's what you get when you extract location from 100 tweets:
β Clean
"New York, NY"
β Usable
"Brooklyn"
β GPS
40.7128Β° N, 74.0060Β° W
β Vague
"East Coast"
β Joke
"Mars"
β Meme
"In your mom's house"
β Abstract
"Everywhere and nowhere"
β Missing
[blank]
Of 100 social media posts with "location":
You need to clean this data before mapping.
Geocoding - Converting location strings to standardized latitude/longitude coordinates
Geocoding = Converting a location description (address, city name, landmark) into geographic coordinates (latitude, longitude)
| Input (Location String) | Geocoding β | Output (Coordinates) |
|---|---|---|
| "New York City" | β | 40.7128Β° N, 74.0060Β° W |
| "Eiffel Tower, Paris" | β | 48.8584Β° N, 2.2945Β° E |
| "1600 Pennsylvania Ave, DC" | β | 38.8977Β° N, 77.0365Β° W |
| "Los Angeles, CA" | β | 34.0522Β° N, 118.2437Β° W |
Most accurate, $5 per 1,000 requests after free tier
Free, open-source, less accurate
Good accuracy, 100,000 free requests/month
Mapping software (Leaflet, Google Maps, Mapbox) needs coordinates, not text.
Geocoding = translating human-readable locations into machine-readable coordinates
Scenario: You have 100 tweets with location strings. Let's geocode them.
| Raw Location | Cleaned | Valid? |
|---|---|---|
| "New York" | "New York, NY, USA" | β |
| "Brooklyn" | "Brooklyn, NY, USA" | β |
| "Mars" | [filter out] | β |
| "USA" | [too vague, skip] | β οΈ |
Common Issues:
Of 100 tweets:
60% success rate is typical for social media data
How precise is location data?
Example:
GPS coordinates can pinpoint your home, workplace, school.
Calculate: If someone posts 3 tweets from their home location (GPS coordinates), can you identify them?
Real example: Fitness app data leaks
2018: Strava (fitness app) published heatmap of user running routes
Even aggregated location data can reveal sensitive information. Analyze: What went wrong?
What's the harm of publishing aggregated location data?
Scenario: You're mapping tweet locations for a research paper on climate protests.
Your map shows:
Potential harms:
Design anonymization rules for geospatial research:
Before publishing a map, you must:
Propose 3 rules to protect privacy while enabling spatial research.
Hint: Think about precision levels, minimum counts, data aggregation
Once you have geocoded data, how do you visualize it? Three main approaches:
What: Regions (states, counties, zip codes) colored by data value
Best for: Showing aggregated data across regions
Example: COVID cases by state
Examples: Election results, COVID rates, median income
What: Individual points (pins/markers) at specific locations
Best for: Showing discrete events or locations
Example: Protest locations
π π π π π
Each pin = one protest
Examples: Store locations, crime incidents, protest events
What: Color gradient showing concentration/density
Best for: Showing where activity is concentrated
Example: Tweet density in NYC
Red = high density, Yellow = low
Examples: Tweet density, crime hotspots, foot traffic
Data: % of votes for each candidate by state
Map type: Choropleth (color each state by winner)
Insight: See urban vs rural divide, swing states
Data: 7,000+ protests with GPS coordinates
Map type: Marker map (one pin per protest)
Insight: See which cities had most sustained activity
Solution: Use marker clustering (group nearby markers)
Example: Click protest marker β See date, size, demands
Example: Show only protests > 1,000 people
Example: "15 protests in this area" β zoom to see individuals
Data: 100,000 tweets with GPS coordinates
Map type: Heatmap (color gradient by density)
Insight: See which neighborhoods had most social media activity
Solution: Normalize by population (tweets per capita, not raw count)
Even 3-4 location points can de-anonymize someone.
Research finding: 95% of people can be uniquely identified from 4 spatiotemporal points.
Rule: Only show data if at least K people (usually 5-10) share that location
Example:
Location A: 8 tweets β β Show (Kβ₯5)
Location B: 2 tweets β β Hide (K<5)
Protects: Individual identification
Rule: Instead of exact coordinates, group into grid cells (e.g., 1km x 1km)
Example:
Exact: 40.7128Β°, -74.0060Β° (specific building)
Grid: "Grid cell 1234" (1km area, ~10,000 people)
Protects: Exact home/work locations
Rule: Round coordinates to fewer decimal places
Example:
Precise: 40.7128456Β° (Β±10 meters)
Cloaked: 40.71Β° (Β±1 kilometer)
Protects: Precision-based identification
Rule: Don't show exact times, round to hour/day/week
Example:
Exact: "2:47 PM, June 15, 2025"
Fuzzed: "Afternoon, June 2025"
Protects: Tracking movement patterns
Best Practice: Combine multiple techniques (K-anonymity + Grid aggregation + Temporal fuzzing)
Research this 2022 case:
2022: Cell Phone Data at Abortion Clinics
After Roe v Wade was overturned, data brokers sold location data showing:
Anti-abortion groups used this to identify and target women seeking abortions.
What did you learn about the legal vs ethical boundaries of location data use?
Spectrum of geospatial analysis:
| Use Case | Ethical? |
|---|---|
| A: Mapping COVID spread to allocate resources | β Public health |
| B: Mapping protest locations for journalism | β οΈ Depends on anonymization |
| C: Tracking abortion clinic visitors | β Surveillance |
When does geospatial analysis cross from research/journalism into surveillance?
Design ethical boundaries:
You're a data journalist. Your editor asks you to map:
Questions:
Create ethical boundaries for location data journalism:
Before publishing a map with location data, ask:
Propose 4 questions journalists should ask before publishing geospatial data.
You have CSV data with messy location strings. Create a map.
| Date | Location | Size |
|---|---|---|
| 2025-03-15 | New York City | 10,000 |
| 2025-03-15 | Los Angeles | 5,000 |
| 2025-03-20 | Brooklyn, NYC | 2,000 |
| 2025-03-22 | San Francisco | 3,500 |
| ... | ... | ... |
Convert "New York City" β coordinates
Choropleth? Marker? Heatmap?
Use k-anonymity or grid aggregation
Create interactive map with Leaflet.js
In VineAnalyst: You'd upload CSV β Choose settings β Generate map automatically
For now, this is a conceptual walkthrough.
Your data: 50 climate protests across 30 US cities
Question: Which map type should you use?
How it would look: Color each state by # of protests
Problem: Loses city-level detail. Can't see NYC vs rural NY.
Verdict: β Wrong choice for this data
How it would look: One pin per protest
Advantage: See each individual protest location
Interactive: Click pin β See date, size, location
Verdict: β Good choice! (50 protests = manageable)
How it would look: Color gradient showing density
Advantage: See concentration (e.g., NYC has many protests)
Problem: Loses individual event detail
Verdict: β οΈ Could work, but markers better for this size
Why? 50 protests is small enough to show individually, and we want to preserve event-level detail.
Some protests have < 5 attendees. Showing exact location could identify individuals.
Rule: Only show protests with β₯5 attendees
Result:
45 protests shown (β₯5 people)
5 protests hidden (<5 people)
Pro: Simple, protects small groups
Con: Loses data on small protests
Rule: Snap all coordinates to 10km grid
Result:
Instead of exact street address, show "Grid 1234" (10km area)
Pro: Keeps all data, reduces precision
Con: Loses neighborhood-level detail
Rules:
Pro: Balances detail with privacy
Con: More complex
Shows most data while protecting small groups.
Research historical context:
1930s-1960s: Redlining
Banks drew red lines on maps around Black neighborhoods, denying them mortgages.
Government maps literally color-coded neighborhoods:
Result: Systemic wealth inequality that persists today.
How did maps encode discrimination?
Modern digital redlining examples:
Example 1: Uber Surge Pricing
Research found higher surge pricing in predominantly Black neighborhoods, even with same demand.
Example 2: Food Delivery Zones
DoorDash, Uber Eats exclude certain zip codes (often low-income, minority).
Example 3: Broadband Access
ISPs map "unprofitable" areas (often redlined neighborhoods) and don't invest in infrastructure.
How do location-based algorithms perpetuate historical discrimination?
Can geospatial analysis perpetuate discrimination?
Consider:
When does mapping reveal inequality vs. perpetuate it?
Design location-aware systems that don't discriminate:
If you're building a location-based service (ride-sharing, delivery, etc.), how do you:
Propose 3 principles for equitable geospatial systems.
β Geocoding
Converting messy location strings β standardized coordinates. 60% success rate typical for social media data.
β Three Map Types
β Privacy Protection
β Real-World Applications
Public health (COVID tracking), social movements (protest mapping), urban planning (food deserts), journalism (election analysis)
β οΈ Privacy Minefield
3-4 location points can de-anonymize 95% of people. Always apply privacy protections.
β οΈ Surveillance vs. Journalism
Legal β ethical. Tracking abortion clinic visitors is legal but harmful.
β οΈ Digital Redlining
Location-based algorithms can perpetuate historical discrimination (Uber surge pricing, delivery zones).
The Golden Rule: With great spatial data comes great responsibility. Map thoughtfully.
You've mastered geospatial analysis and understand the ethical boundaries of mapping digital behavior.