PGH Transit Atlas

1. Research Question & Motivation

Core Question: How can we optimize micro-mobility integration with public transit in a student-dominated urban environment?

Pittsburgh's bikeshare system operates in a unique context: 68% of trips occur in the Campus Corridor (CMU/Pitt bounding box). This creates extreme seasonal volatility—ridership drops 63% during academic breaks. Traditional transit planning assumes stable demand; this analysis reveals the necessity of dynamic fleet scaling tied to the academic calendar.

Key Insight: Unlike typical bikeshare systems that serve commuters year-round, Pittsburgh's system functions as a "Campus Mobility Extension" requiring different operational strategies than traditional urban bikeshare.

2. Data Sources & Processing

2.1 POGOH Bikeshare Dataset

Source: Official POGOH trip logs (2024 full year)
Volume: 556,437 trips
Schema: Start/End timestamps, Station names, Duration, Rider type (Member/Casual), Demographics
Cleaning: Removed trips >180 min (outliers/theft), geocoded station coordinates via fuzzy matching

# Data Cleaning Pipeline (etl.py lines 85-110)
import pandas as pd
from datetime import datetime

# Load raw data
pogoh = pd.read_excel('dataset/POGOH_2024.xlsx')

# Parse timestamps
pogoh['Start Date'] = pd.to_datetime(pogoh['Start Date'])
pogoh['End Date'] = pd.to_datetime(pogoh['End Date'])

# Calculate duration in seconds
pogoh['Duration'] = (pogoh['End Date'] - pogoh['Start Date']).dt.total_seconds()

# Remove outliers (>180 min = theft/data error)
trips_clean = pogoh[pogoh['Duration'] <= 10800]  # 180 min

# Extract temporal features
trips_clean['hour'] = trips_clean['Start Date'].dt.hour
trips_clean['day_of_week'] = trips_clean['Start Date'].dt.day_name()
trips_clean['month'] = trips_clean['Start Date'].dt.month
                

2.2 PRT Bus Stop Dataset

Source: Port Authority Transit (PRT) official stop registry
Volume: 1,223 bus stops
Schema: Stop name, Coordinates (lat/lon), Annual boardings
Integration: Haversine distance calculation to find POGOH stations within 400m of each bus stop

# Haversine Distance Calculation (etl.py lines 320-335)
from math import radians, sin, cos, sqrt, atan2

def haversine(lat1, lon1, lat2, lon2):
    R = 6371000  # Earth radius in meters

    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))

    return R * c  # Returns distance in meters

# Find bikeshare stations within 400m of each bus stop
for idx, bus_stop in bus_stops.iterrows():
    nearby_trips = trips_clean[
        trips_clean['Start Station Name'].apply(
            lambda x: haversine(x.lat, x.lon, bus_stop.lat, bus_stop.lon) <= 400
        )
    ]
    bus_stops.loc[idx, 'bike_trips_nearby'] = len(nearby_trips)
                

3. Temporal Dynamics: The "Student Effect"

3.1 Daily Ridership Pattern

The most striking feature of Pittsburgh's bikeshare is the academic calendar dependency. Daily ridership fluctuates from 412 trips (winter break nadir) to 3,800+ trips (fall semester peak)—a 9× variance.

Figure 1: 365-day timeseries showing Campus vs City ridership. Campus trips (orange) collapse during academic breaks while City trips (green) remain stable.

Policy Implication: Fleet sizing must be dynamically adjusted based on academic calendar. Operating a full fleet during winter break wastes capital on underutilized bikes. Conversely, insufficient capacity during September orientation week creates access bottlenecks.

Methodology: Campus Geofencing

Trips flagged as "Campus Corridor" if start OR end coordinates fall within:

Latitude: 40.435°N to 40.450°N
Longitude: -79.970°W to -79.940°W

This bounding box captures CMU, University of Pittsburgh, and Shadyside neighborhoods.

# Campus Flag Logic (etl.py lines 156-165)
CAMPUS_LAT_MIN, CAMPUS_LAT_MAX = 40.435, 40.450
CAMPUS_LON_MIN, CAMPUS_LON_MAX = -79.970, -79.940

trips_clean['is_campus'] = (
    ((trips_clean['Start Lat'] >= CAMPUS_LAT_MIN) &
     (trips_clean['Start Lat'] <= CAMPUS_LAT_MAX) &
     (trips_clean['Start Lon'] >= CAMPUS_LON_MIN) &
     (trips_clean['Start Lon'] <= CAMPUS_LON_MAX)) |
    ((trips_clean['End Lat'] >= CAMPUS_LAT_MIN) &
     (trips_clean['End Lat'] <= CAMPUS_LAT_MAX) &
     (trips_clean['End Lon'] >= CAMPUS_LON_MIN) &
     (trips_clean['End Lon'] <= CAMPUS_LON_MAX))
)

# Result: 68.1% of all trips touch the Campus Corridor
campus_pct = trips_clean['is_campus'].sum() / len(trips_clean) * 100
print(f"Campus trips: {campus_pct:.1f}%")  # Output: 68.1%
                

3.2 Hourly & Weekly Patterns

Beyond seasonal shifts, intra-day patterns reveal two distinct peaks: morning rush (8-9 AM) and evening rush (4-6 PM). Weekend patterns flatten significantly, with Saturday/Sunday showing 30-40% lower volume than weekdays.

Figure 2: Interactive hourly pattern (Bokeh). Hover over points to see exact trip counts. Clear bimodal distribution at 9 AM and 5 PM.

Figure 3: Day × Hour heatmap (Seaborn). Darker red = higher demand. Notice the "Weekend Cooling" (Sat/Sun columns) and "Morning Surge" (7-9 AM rows).

Rebalancing Strategy: The heatmap reveals exact windows for fleet repositioning. Optimal rebalancing occurs during 10-11 AM (post-morning rush, pre-lunch) and 7-8 PM (post-evening rush). These low-demand windows minimize user disruption.

4. Unsupervised Learning: Trip Archetypes

4.1 K-Means Clustering Methodology

To understand why people ride, we apply K-Means clustering on three features: Duration, Displacement, and Start Hour. This unsupervised approach identifies behavioral patterns without labeled training data.

Algorithm Rationale

K-Means (k=4): Chosen for interpretability. Silhouette analysis validated 4 as optimal cluster count (score: 0.68).
Feature Scaling: StandardScaler ensures duration (seconds), displacement (meters), and hour (0-23) contribute equally.
Feature Selection: Duration + Displacement capture trip purpose better than speed alone. Hour captures temporal behavior (commute vs leisure).

# K-Means Implementation (etl.py lines 196-223)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Prepare features
features = trips_clean[['Duration', 'displacement', 'hour']].dropna()

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Fit K-Means (k=4 determined via elbow method)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
trips_clean['archetype'] = kmeans.fit_predict(features_scaled)

# Calculate cluster centroids
centroids = trips_clean.groupby('archetype').agg({
    'Duration': 'mean',
    'displacement': 'mean',
    'hour': 'mean'
})

# Manual labeling based on centroid inspection:
# Cluster 0: Short duration (7.7 min), peak at 5:47 PM → "Commuter"
# Cluster 1: Medium duration (20.1 min), mid-day → "Errand"
# Cluster 2: Very short (7.0 min), morning peak → "Last-Mile"
# Cluster 3: LONG duration (73.2 min!), low displacement → "Leisure"
                

4.2 Identified Archetypes

Figure 4: Trip distribution by behavioral archetype (Seaborn). Commuter trips dominate at 47.9%, followed by Last-Mile connectors at 32.8%.

Unexpected Finding: Leisure trips (3.6%) have an average duration of 73.2 minutes with displacement of only 737m. This suggests recreational "circular routes" along the riverfront trail system—users exploring rather than commuting. These trips require different bike availability (longer rental periods, trail-adjacent stations).

Interpretation Note: "Last-Mile" trips (32.8%) peak at 9:18 AM with 7-minute duration. These are not standalone trips—they're bikeshare-to-bus connections. Cross-referencing with PRT data confirms high overlap with major bus hubs (Boulevard of the Allies, S Millvale Ave).

5. Station Behavioral Profiling

5.1 Reverse Analysis: Station "Personalities"

After identifying trip archetypes, we reverse the question: Which stations generate which behaviors? This reveals "station personalities"—some stations are pure commuter hubs (64.8% commuter trips), while others are errand centers (68.4% errand trips). This profiling is critical for:

Targeted Rebalancing: Commuter hubs need bikes at 8 AM; errand centers need them at 2 PM
Bike Type Assignment: Leisure stations should stock e-bikes for longer trips; commuter hubs need standard bikes
Capacity Planning: High-percentage stations signal demand concentration

Methodology

For each station, calculate percentage of trips matching each archetype (Commuter/Last-Mile/Errand/Leisure)
Filter for statistical significance: only stations with 50+ total trips
Select top 3 stations with highest percentage for each archetype

# Station Profiling (etl.py lines 580-617)
station_archetype_top = {}

for archetype in ['Commuter', 'Last-Mile', 'Errand', 'Leisure']:
    # Count trips by station for this archetype
    station_counts = trips_clean.groupby(['Start Station Name', 'archetype_label']).size()

    # Get total trips per station
    station_totals = trips_clean.groupby('Start Station Name').size()

    # Calculate percentage
    station_pct = (station_counts / station_totals) * 100

    # Filter: only stations with 50+ trips
    significant_stations = station_pct[station_totals >= 50]

    # Get top 3 by percentage
    top_3 = significant_stations.nlargest(3)

    station_archetype_top[archetype] = {
        'stations': top_3.index.tolist(),
        'percentages': top_3.values.tolist()
    }
                

5.2 Behavioral Hotspots

Figure 5: Top 3 stations per archetype (Seaborn 2×2 grid). Each quadrant shows stations with highest concentration of that behavior type.

Key Findings:

Schenley Dr & Schenley Dr Ext: 64.8% Commuter trips (12,721 of 19,637) — Located at CMU campus edge, pure commuter function
Wilkinsburg Park & Ride: 68.4% Errand trips (444 of 649) — Suburban station used primarily for shopping/service trips
Boulevard of the Allies: 46.9% Last-Mile trips (13,859 of 29,538) — Critical transit feeder with highest volume
South Side Trail: 19.0% Leisure trips (712 of 3,739) — Recreational waterfront destination

Operational Insight: Schenley Dr (64.8% commuter) vs South Side Trail (19.0% leisure) require completely different operational strategies. Schenley needs predictable 8 AM bike availability for class commutes; South Side needs afternoon/weekend capacity for exploratory rides. One-size-fits-all rebalancing fails both station types.

6. Rider Demographics: Member vs Casual

POGOH operates a membership model (annual/monthly passes) alongside casual rentals. The distribution is heavily skewed: 98.7% Member, 1.3% Casual. This indicates bikeshare functions as a utilitarian commute tool, not tourism/recreation.

Figure 6: Member vs Casual ridership (Seaborn). Members account for 549,304 trips vs 7,133 casual trips.

Implication: Revenue model should prioritize membership retention over casual conversion. The 98.7% member share suggests Pittsburgh bikeshare succeeds as a daily commute replacement, not a tourist amenity. Marketing should focus on student/staff annual memberships tied to CMU/Pitt ID integration.

Figure 7: Top 10 stations showing Member (blue) vs Casual (orange) split (Seaborn grouped bars). Notice S Bouquet Ave has highest casual share—likely due to proximity to visitor destinations.

7. Interactive Archetype Explorer

Figure 8: Interactive archetype bar chart (Bokeh). Hover to see exact trip counts and percentages.

8. Synthesis & Policy Recommendations

8.1 Core Findings

Academic Calendar Dependency: 68% of trips occur in Campus Corridor, with 63% ridership drop during winter break. Action: Implement dynamic fleet sizing tied to CMU/Pitt academic calendars.
Behavioral Segmentation: 4 distinct rider archetypes identified via K-Means. Commuters (47.9%) need reliability; Leisure riders (3.6%) need flexibility. Action: Differentiated service levels per archetype.
Station Specialization: Schenley Dr is 64.8% commuter; Wilkinsburg P&R is 68.4% errand. Action: Station-specific rebalancing schedules based on dominant archetype.
Member Dominance: 98.7% member trips indicate success as utilitarian tool, not tourist amenity. Action: Double down on institutional partnerships (CMU ID integration, corporate memberships).

9. Processed Data Exports

All processed datasets available in /processed_data/ directory for reproducibility:

CSV Files (14 total):

daily_timeseries.csv
archetypes.csv
demographics.csv
station_archetypes.csv
heatmap.csv
monthly_trends.csv
duration_distribution.csv
directionality.csv
correlation.csv
top_prt_pogoh.csv
bike_stations_geo.csv
bus_stops_geo.csv
seasonal_heatmap.csv
multimodal_hubs.csv

JSON Files (13 total): Identical data in JSON format for web dashboard integration.

Reproducibility: This analysis can be fully reproduced by running python3 etl.py followed by python3 generate_static_viz.py. All data sources, cleaning steps, and statistical methods are documented in-line with code comments matching this report's narrative.

10. EDA Thought Process: From Raw Data to Insight

Step 1: Initial Data Exploration

# First look at the data structure
pogoh = pd.read_excel('dataset/POGOH_2024.xlsx')
print(pogoh.shape)  # (556437, 12)
print(pogoh.info())
print(pogoh.describe())

# Thought: 556K trips is substantial. What's the temporal coverage?
print(pogoh['Start Date'].min(), pogoh['Start Date'].max())
# Output: 2024-01-01 to 2024-12-31 → Full year ✓
            

Step 2: Identifying Anomalies

# Check for outliers in duration
print(pogoh['Duration'].describe())
# Max: 86,400 seconds (24 hours!) → Likely theft or forgotten bike

# Distribution check
print(pogoh[pogoh['Duration'] > 10800].shape)  # 1,234 trips >3 hours

# Decision: Remove trips >180 min as outliers (99.8th percentile)
trips_clean = pogoh[pogoh['Duration'] <= 10800]
            

Step 3: Feature Engineering

# Question: Do students vs residents have different trip patterns?
# Hypothesis: Campus trips should be shorter, peak-hour focused

# Create campus flag via geographic bounding box
trips_clean['is_campus'] = (
    (trips_clean['Start Lat'].between(40.435, 40.450)) &
    (trips_clean['Start Lon'].between(-79.970, -79.940))
)

# Validate hypothesis
print(trips_clean.groupby('is_campus')['Duration'].mean())
# Campus: 9.2 min avg | City: 12.7 min avg → Hypothesis CONFIRMED ✓
            

Step 4: Clustering for Behavior Discovery

Instead of assuming trip purposes (commute/leisure), let unsupervised learning discover natural groupings. K-Means revealed 4 clusters, including a surprising "Leisure" segment with 73-minute average duration—this wasn't anticipated in the initial hypothesis.

# Question: What natural behavioral segments exist?
# Approach: K-Means on Duration, Displacement, Hour

from sklearn.cluster import KMeans

features = trips_clean[['Duration', 'displacement', 'hour']]
kmeans = KMeans(n_clusters=4, random_state=42)
trips_clean['cluster'] = kmeans.fit_predict(features)

# Inspect centroids to assign semantic labels
centroids = kmeans.cluster_centers_
print(centroids)

# Cluster 3: Duration=4392s (73 min!), Displacement=737m, Hour=14.4
# → Long rides, low displacement, afternoon → "Leisure" ✓
            

Step 5: Validation & Iteration

Iterative Refinement: Initial clustering used only Duration + Displacement. This conflated "Commuter" and "Last-Mile" (both short, efficient trips). Adding Start Hour as a third feature separated them: Commuters peak at 5 PM, Last-Mile peaks at 9 AM. This demonstrates importance of domain knowledge in feature engineering.