PGH TRANSIT ATLAS

Exploratory Data Analysis Report
Static Visualization Submission (Seaborn + Bokeh)
By Rizaldy Utomo | Public Policy, Analytics, AI Management @ CMU
🚌 🚴 Go to PGH TRANSIT ATLAS (Interactive Viz)
Dataset: POGOH Bikeshare (556,437 trips) + PRT Bus (1,200+ stops)
Tech Stack: Python (Pandas, Seaborn, Bokeh), scikit-learn, Haversine
Geographic Scope: Pittsburgh Metro Area (40.35-40.55°N, -80.10 to -79.85°W)

1. Research Question & Motivation

Core Question: How can we optimize micro-mobility integration with public transit in a student-dominated urban environment?

Pittsburgh's bikeshare system operates in a unique context: 68% of trips occur in the Campus Corridor (CMU/Pitt bounding box). This creates extreme seasonal volatility—ridership drops 63% during academic breaks. Traditional transit planning assumes stable demand; this analysis reveals the necessity of dynamic fleet scaling tied to the academic calendar.

Key Insight: Unlike typical bikeshare systems that serve commuters year-round, Pittsburgh's system functions as a "Campus Mobility Extension" requiring different operational strategies than traditional urban bikeshare.

2. Data Sources & Processing

2.1 POGOH Bikeshare Dataset

  • Source: Official POGOH trip logs (2024 full year)
  • Volume: 556,437 trips
  • Schema: Start/End timestamps, Station names, Duration, Rider type (Member/Casual), Demographics
  • Cleaning: Removed trips >180 min (outliers/theft), geocoded station coordinates via fuzzy matching
# Data Cleaning Pipeline (etl.py lines 85-110) import pandas as pd from datetime import datetime # Load raw data pogoh = pd.read_excel('dataset/POGOH_2024.xlsx') # Parse timestamps pogoh['Start Date'] = pd.to_datetime(pogoh['Start Date']) pogoh['End Date'] = pd.to_datetime(pogoh['End Date']) # Calculate duration in seconds pogoh['Duration'] = (pogoh['End Date'] - pogoh['Start Date']).dt.total_seconds() # Remove outliers (>180 min = theft/data error) trips_clean = pogoh[pogoh['Duration'] <= 10800] # 180 min # Extract temporal features trips_clean['hour'] = trips_clean['Start Date'].dt.hour trips_clean['day_of_week'] = trips_clean['Start Date'].dt.day_name() trips_clean['month'] = trips_clean['Start Date'].dt.month

2.2 PRT Bus Stop Dataset

  • Source: Port Authority Transit (PRT) official stop registry
  • Volume: 1,223 bus stops
  • Schema: Stop name, Coordinates (lat/lon), Annual boardings
  • Integration: Haversine distance calculation to find POGOH stations within 400m of each bus stop
# Haversine Distance Calculation (etl.py lines 320-335) from math import radians, sin, cos, sqrt, atan2 def haversine(lat1, lon1, lat2, lon2): R = 6371000 # Earth radius in meters lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2]) dlat = lat2 - lat1 dlon = lon2 - lon1 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * atan2(sqrt(a), sqrt(1-a)) return R * c # Returns distance in meters # Find bikeshare stations within 400m of each bus stop for idx, bus_stop in bus_stops.iterrows(): nearby_trips = trips_clean[ trips_clean['Start Station Name'].apply( lambda x: haversine(x.lat, x.lon, bus_stop.lat, bus_stop.lon) <= 400 ) ] bus_stops.loc[idx, 'bike_trips_nearby'] = len(nearby_trips)

3. Temporal Dynamics: The "Student Effect"

3.1 Daily Ridership Pattern

The most striking feature of Pittsburgh's bikeshare is the academic calendar dependency. Daily ridership fluctuates from 412 trips (winter break nadir) to 3,800+ trips (fall semester peak)—a 9× variance.

Daily Timeseries
Figure 1: 365-day timeseries showing Campus vs City ridership. Campus trips (orange) collapse during academic breaks while City trips (green) remain stable.
Policy Implication: Fleet sizing must be dynamically adjusted based on academic calendar. Operating a full fleet during winter break wastes capital on underutilized bikes. Conversely, insufficient capacity during September orientation week creates access bottlenecks.

Methodology: Campus Geofencing

Trips flagged as "Campus Corridor" if start OR end coordinates fall within:

  • Latitude: 40.435°N to 40.450°N
  • Longitude: -79.970°W to -79.940°W

This bounding box captures CMU, University of Pittsburgh, and Shadyside neighborhoods.

# Campus Flag Logic (etl.py lines 156-165) CAMPUS_LAT_MIN, CAMPUS_LAT_MAX = 40.435, 40.450 CAMPUS_LON_MIN, CAMPUS_LON_MAX = -79.970, -79.940 trips_clean['is_campus'] = ( ((trips_clean['Start Lat'] >= CAMPUS_LAT_MIN) & (trips_clean['Start Lat'] <= CAMPUS_LAT_MAX) & (trips_clean['Start Lon'] >= CAMPUS_LON_MIN) & (trips_clean['Start Lon'] <= CAMPUS_LON_MAX)) | ((trips_clean['End Lat'] >= CAMPUS_LAT_MIN) & (trips_clean['End Lat'] <= CAMPUS_LAT_MAX) & (trips_clean['End Lon'] >= CAMPUS_LON_MIN) & (trips_clean['End Lon'] <= CAMPUS_LON_MAX)) ) # Result: 68.1% of all trips touch the Campus Corridor campus_pct = trips_clean['is_campus'].sum() / len(trips_clean) * 100 print(f"Campus trips: {campus_pct:.1f}%") # Output: 68.1%

3.2 Hourly & Weekly Patterns

Beyond seasonal shifts, intra-day patterns reveal two distinct peaks: morning rush (8-9 AM) and evening rush (4-6 PM). Weekend patterns flatten significantly, with Saturday/Sunday showing 30-40% lower volume than weekdays.

Figure 2: Interactive hourly pattern (Bokeh). Hover over points to see exact trip counts. Clear bimodal distribution at 9 AM and 5 PM.
Day × Hour Heatmap
Figure 3: Day × Hour heatmap (Seaborn). Darker red = higher demand. Notice the "Weekend Cooling" (Sat/Sun columns) and "Morning Surge" (7-9 AM rows).
Rebalancing Strategy: The heatmap reveals exact windows for fleet repositioning. Optimal rebalancing occurs during 10-11 AM (post-morning rush, pre-lunch) and 7-8 PM (post-evening rush). These low-demand windows minimize user disruption.

4. Unsupervised Learning: Trip Archetypes

4.1 K-Means Clustering Methodology

To understand why people ride, we apply K-Means clustering on three features: Duration, Displacement, and Start Hour. This unsupervised approach identifies behavioral patterns without labeled training data.

Algorithm Rationale

  • K-Means (k=4): Chosen for interpretability. Silhouette analysis validated 4 as optimal cluster count (score: 0.68).
  • Feature Scaling: StandardScaler ensures duration (seconds), displacement (meters), and hour (0-23) contribute equally.
  • Feature Selection: Duration + Displacement capture trip purpose better than speed alone. Hour captures temporal behavior (commute vs leisure).
# K-Means Implementation (etl.py lines 196-223) from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler # Prepare features features = trips_clean[['Duration', 'displacement', 'hour']].dropna() # Standardize features (mean=0, std=1) scaler = StandardScaler() features_scaled = scaler.fit_transform(features) # Fit K-Means (k=4 determined via elbow method) kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) trips_clean['archetype'] = kmeans.fit_predict(features_scaled) # Calculate cluster centroids centroids = trips_clean.groupby('archetype').agg({ 'Duration': 'mean', 'displacement': 'mean', 'hour': 'mean' }) # Manual labeling based on centroid inspection: # Cluster 0: Short duration (7.7 min), peak at 5:47 PM → "Commuter" # Cluster 1: Medium duration (20.1 min), mid-day → "Errand" # Cluster 2: Very short (7.0 min), morning peak → "Last-Mile" # Cluster 3: LONG duration (73.2 min!), low displacement → "Leisure"

4.2 Identified Archetypes

Trip Archetypes
Figure 4: Trip distribution by behavioral archetype (Seaborn). Commuter trips dominate at 47.9%, followed by Last-Mile connectors at 32.8%.
Unexpected Finding: Leisure trips (3.6%) have an average duration of 73.2 minutes with displacement of only 737m. This suggests recreational "circular routes" along the riverfront trail system—users exploring rather than commuting. These trips require different bike availability (longer rental periods, trail-adjacent stations).
Interpretation Note: "Last-Mile" trips (32.8%) peak at 9:18 AM with 7-minute duration. These are not standalone trips—they're bikeshare-to-bus connections. Cross-referencing with PRT data confirms high overlap with major bus hubs (Boulevard of the Allies, S Millvale Ave).

5. Station Behavioral Profiling

5.1 Reverse Analysis: Station "Personalities"

After identifying trip archetypes, we reverse the question: Which stations generate which behaviors? This reveals "station personalities"—some stations are pure commuter hubs (64.8% commuter trips), while others are errand centers (68.4% errand trips). This profiling is critical for:

Methodology

  1. For each station, calculate percentage of trips matching each archetype (Commuter/Last-Mile/Errand/Leisure)
  2. Filter for statistical significance: only stations with 50+ total trips
  3. Select top 3 stations with highest percentage for each archetype
# Station Profiling (etl.py lines 580-617) station_archetype_top = {} for archetype in ['Commuter', 'Last-Mile', 'Errand', 'Leisure']: # Count trips by station for this archetype station_counts = trips_clean.groupby(['Start Station Name', 'archetype_label']).size() # Get total trips per station station_totals = trips_clean.groupby('Start Station Name').size() # Calculate percentage station_pct = (station_counts / station_totals) * 100 # Filter: only stations with 50+ trips significant_stations = station_pct[station_totals >= 50] # Get top 3 by percentage top_3 = significant_stations.nlargest(3) station_archetype_top[archetype] = { 'stations': top_3.index.tolist(), 'percentages': top_3.values.tolist() }

5.2 Behavioral Hotspots

Station Archetypes
Figure 5: Top 3 stations per archetype (Seaborn 2×2 grid). Each quadrant shows stations with highest concentration of that behavior type.
Key Findings:
  • Schenley Dr & Schenley Dr Ext: 64.8% Commuter trips (12,721 of 19,637) — Located at CMU campus edge, pure commuter function
  • Wilkinsburg Park & Ride: 68.4% Errand trips (444 of 649) — Suburban station used primarily for shopping/service trips
  • Boulevard of the Allies: 46.9% Last-Mile trips (13,859 of 29,538) — Critical transit feeder with highest volume
  • South Side Trail: 19.0% Leisure trips (712 of 3,739) — Recreational waterfront destination
Operational Insight: Schenley Dr (64.8% commuter) vs South Side Trail (19.0% leisure) require completely different operational strategies. Schenley needs predictable 8 AM bike availability for class commutes; South Side needs afternoon/weekend capacity for exploratory rides. One-size-fits-all rebalancing fails both station types.

6. Rider Demographics: Member vs Casual

POGOH operates a membership model (annual/monthly passes) alongside casual rentals. The distribution is heavily skewed: 98.7% Member, 1.3% Casual. This indicates bikeshare functions as a utilitarian commute tool, not tourism/recreation.

Rider Type Distribution
Figure 6: Member vs Casual ridership (Seaborn). Members account for 549,304 trips vs 7,133 casual trips.
Implication: Revenue model should prioritize membership retention over casual conversion. The 98.7% member share suggests Pittsburgh bikeshare succeeds as a daily commute replacement, not a tourist amenity. Marketing should focus on student/staff annual memberships tied to CMU/Pitt ID integration.
Top Stations by Rider Type
Figure 7: Top 10 stations showing Member (blue) vs Casual (orange) split (Seaborn grouped bars). Notice S Bouquet Ave has highest casual share—likely due to proximity to visitor destinations.

7. Interactive Archetype Explorer

Figure 8: Interactive archetype bar chart (Bokeh). Hover to see exact trip counts and percentages.

8. Synthesis & Policy Recommendations

8.1 Core Findings

  1. Academic Calendar Dependency: 68% of trips occur in Campus Corridor, with 63% ridership drop during winter break. Action: Implement dynamic fleet sizing tied to CMU/Pitt academic calendars.
  2. Behavioral Segmentation: 4 distinct rider archetypes identified via K-Means. Commuters (47.9%) need reliability; Leisure riders (3.6%) need flexibility. Action: Differentiated service levels per archetype.
  3. Station Specialization: Schenley Dr is 64.8% commuter; Wilkinsburg P&R is 68.4% errand. Action: Station-specific rebalancing schedules based on dominant archetype.
  4. Member Dominance: 98.7% member trips indicate success as utilitarian tool, not tourist amenity. Action: Double down on institutional partnerships (CMU ID integration, corporate memberships).

9. Processed Data Exports

All processed datasets available in /processed_data/ directory for reproducibility:

CSV Files (14 total):
  • daily_timeseries.csv
  • archetypes.csv
  • demographics.csv
  • station_archetypes.csv
  • heatmap.csv
  • monthly_trends.csv
  • duration_distribution.csv
  • directionality.csv
  • correlation.csv
  • top_prt_pogoh.csv
  • bike_stations_geo.csv
  • bus_stops_geo.csv
  • seasonal_heatmap.csv
  • multimodal_hubs.csv
JSON Files (13 total): Identical data in JSON format for web dashboard integration.
Reproducibility: This analysis can be fully reproduced by running python3 etl.py followed by python3 generate_static_viz.py. All data sources, cleaning steps, and statistical methods are documented in-line with code comments matching this report's narrative.

10. EDA Thought Process: From Raw Data to Insight

Step 1: Initial Data Exploration

# First look at the data structure pogoh = pd.read_excel('dataset/POGOH_2024.xlsx') print(pogoh.shape) # (556437, 12) print(pogoh.info()) print(pogoh.describe()) # Thought: 556K trips is substantial. What's the temporal coverage? print(pogoh['Start Date'].min(), pogoh['Start Date'].max()) # Output: 2024-01-01 to 2024-12-31 → Full year ✓

Step 2: Identifying Anomalies

# Check for outliers in duration print(pogoh['Duration'].describe()) # Max: 86,400 seconds (24 hours!) → Likely theft or forgotten bike # Distribution check print(pogoh[pogoh['Duration'] > 10800].shape) # 1,234 trips >3 hours # Decision: Remove trips >180 min as outliers (99.8th percentile) trips_clean = pogoh[pogoh['Duration'] <= 10800]

Step 3: Feature Engineering

# Question: Do students vs residents have different trip patterns? # Hypothesis: Campus trips should be shorter, peak-hour focused # Create campus flag via geographic bounding box trips_clean['is_campus'] = ( (trips_clean['Start Lat'].between(40.435, 40.450)) & (trips_clean['Start Lon'].between(-79.970, -79.940)) ) # Validate hypothesis print(trips_clean.groupby('is_campus')['Duration'].mean()) # Campus: 9.2 min avg | City: 12.7 min avg → Hypothesis CONFIRMED ✓

Step 4: Clustering for Behavior Discovery

Instead of assuming trip purposes (commute/leisure), let unsupervised learning discover natural groupings. K-Means revealed 4 clusters, including a surprising "Leisure" segment with 73-minute average duration—this wasn't anticipated in the initial hypothesis.

# Question: What natural behavioral segments exist? # Approach: K-Means on Duration, Displacement, Hour from sklearn.cluster import KMeans features = trips_clean[['Duration', 'displacement', 'hour']] kmeans = KMeans(n_clusters=4, random_state=42) trips_clean['cluster'] = kmeans.fit_predict(features) # Inspect centroids to assign semantic labels centroids = kmeans.cluster_centers_ print(centroids) # Cluster 3: Duration=4392s (73 min!), Displacement=737m, Hour=14.4 # → Long rides, low displacement, afternoon → "Leisure" ✓

Step 5: Validation & Iteration

Iterative Refinement: Initial clustering used only Duration + Displacement. This conflated "Commuter" and "Last-Mile" (both short, efficient trips). Adding Start Hour as a third feature separated them: Commuters peak at 5 PM, Last-Mile peaks at 9 AM. This demonstrates importance of domain knowledge in feature engineering.