Morello Sims
NBA + MLB
sports simulation × design × methodology

MLB ATLAS

3D PITCHER GALAXY × ARCHETYPE CLUSTERING × HITTER MATCHUPS

34 Archetypes
RHP + LHP
LIVE
5,983 Seasons
2015–2025
GMM
2,696 Batters
vs Cluster
wOBA
OPEN ATLAS → DESKTOP ONLY

NBA SIM

DAILY MATCHUP SIMULATIONS × SPREAD ANALYSIS × SCHEME DETECTION

PHI @ NOP
NOP +4.0
A
SAC @ SAS
SAS -18.5
B
HOU @ NYK
HOU +3.5
B
OPEN DASHBOARD →

MLB SIM

SEASON: 44-19 (+5.7% ROI) · C:10 ONLY · |ODDS|<200

NYY vs BOS
Cole (R)
EV+
LAD vs SDP
Glasnow (R)
N/A
ATL vs PHI
Strider (R)
HOT
OPEN DASHBOARD →
DISPATCH LOG PICKS · METHODOLOGY · SYSTEM UPDATES
PICKS LOG
MAY 4 — MAY 25, 2026

NBA SIM: 115-90 RECORD (+5% ROI)

115-90 RECORD
+5% ROI
1,461 BANKROLL
13 PICKS

13 picks across 12 settled. Auto-rendered from picks/nba.json.

115-90RECORD
+5%ROI
1,461BANKROLL
8,470RISKED
L1STREAK
MAY 25 — MAY 31 PENDING
MAY 25 NYK @ CLE NYK -2.5 C:8 50
MAY 18 — MAY 24 2-1 · +40.9 $PP
MAY 24 OKC @ SAS OKC +2.5 C:8 50 L -50
MAY 23 NYK @ CLE NYK +2.5 C:10 50 W +45.45
MAY 22 OKC @ SAS OKC +2.0 C:9 50 W +45.45
MAY 11 — MAY 17 1-2 · -54.55 $PP
MAY 15 DET @ CLE DET +4.0 C:10 50 W +45.45
MAY 11 DET @ CLE DET +3.5 C:10 50 L -50
MAY 11 OKC @ LAL OKC -11.0 C:8 50 L -50
MAY 4 — MAY 10 4-2 · +81.8 $PP
MAY 10 NYK @ PHI NYK -1.5 C:10 50 W +45.45
MAY 9 DET @ CLE DET +4.5 C:10 50 L -50
MAY 9 OKC @ LAL OKC -8.5 C:10 50 W +45.45
MAY 8 NYK @ PHI NYK +1.5 C:10 50 W +45.45
MAY 6 MIN @ SAS MIN +10.5 C:8 50 L -50
MAY 4 MIN @ SA MIN +11.0 C:8 50 W +45.45

Picks sourced from the NBA SIM pipeline. Lines via The Odds API. Full methodology at nbasim.

PICKS LOG
MAY 1 — MAY 26, 2026

MLB SIM: 119-92 RECORD (-1% ROI)

119-92 RECORD
-1% ROI
911 BANKROLL
161 PICKS

161 picks across 152 settled. Auto-rendered from picks/mlb.json.

119-92RECORD
-1%ROI
911BANKROLL
11,450RISKED
L1STREAK
MAY 25 — MAY 31 5-2 · +117.77 $PP · 9P
MAY 26 STL @ MIL MIL ML -188 100
MAY 26 SEA @ ATH ATH ML -115 100
MAY 26 PHI @ SD PHI ML -110 100
MAY 26 NYY @ KC NYY ML -202 100
MAY 26 MIN @ CWS CWS ML -103 50
MAY 26 MIA @ TOR TOR ML -144 50
MAY 26 HOU @ TEX TEX ML -133 30
MAY 26 CHC @ PIT PIT ML -135 30
MAY 26 ATL @ BOS ATL ML -110 30
MAY 25 WSH @ CLE CLE ML -175 30 L -30
MAY 25 TB @ BAL BAL ML -105 30 W +28.57
MAY 25 STL @ MIL MIL ML -225 50 W +22.22
MAY 25 PHI @ SD PHI ML -118 100 W +84.75
MAY 25 HOU @ TEX TEX ML -125 30 L -30
MAY 25 COL @ LAD LAD ML -325 30 W +9.23
MAY 25 CIN @ NYM CIN ML +110 30 W +33
MAY 18 — MAY 24 22-28 · -304.33 $PP
MAY 24 TEX @ LAA LAA ML -125 50 W +40
MAY 24 TB @ NYY NYY ML -142 30 W +21.13
MAY 24 PIT @ TOR TOR ML -180 50 L -50
MAY 24 NYM @ MIA NYM ML -118 30 L -30
MAY 24 MIN @ BOS BOS ML -174 30 L -30
MAY 24 LAD @ MIL LAD ML -149 100 W +67.11
MAY 24 DET @ BAL DET ML +105 50 W +52.5
MAY 24 DET @ BAL DET ML +105 50 L -50
MAY 24 DET @ BAL DET ML +110 50 L -50
MAY 24 CWS @ SF CWS ML +105 30 L -30
MAY 24 CLE @ PHI CLE ML -115 50 W +43.48
MAY 24 ATH @ SD SD ML -175 50 L -50
MAY 23 WSH @ ATL ATL ML -172 30 L -30
MAY 23 TEX @ LAA TEX ML -144 50 L -50
MAY 23 TB @ NYY NYY ML -140 100 PUSH
MAY 23 PIT @ TOR TOR ML +130 30 W +39
MAY 23 LAD @ MIL MIL ML -102 100 L -100
MAY 23 HOU @ CHC CHC ML -149 100 L -100
MAY 23 DET @ BAL DET ML +104 100 PUSH
MAY 23 CWS @ SF CWS ML -111 100 L -100
MAY 23 ATH @ SD ATH ML -125 50 L -50
MAY 22 TEX @ LAA TEX ML -160 50 L -50
MAY 22 STL @ CIN CIN ML -121 50 PUSH
MAY 22 PIT @ TOR TOR ML -160 100 W +62.5
MAY 22 LAD @ MIL LAD ML -106 100 L -100
MAY 22 HOU @ CHC CHC ML -143 30 L -30
MAY 22 DET @ BAL DET ML +110 30 L -30
MAY 21 TOR @ NYY TOR ML +120 100 W +120
MAY 21 NYM @ WSH NYM ML -111 50 W +45.05
MAY 21 ATH @ LAA ATH ML -100 100 W +100
MAY 20 TEX @ COL COL ML -100 30 L -30
MAY 20 PIT @ STL PIT ML -104 50 W +48.08
MAY 20 MIL @ CHC CHC ML -116 30 L -30
MAY 20 LAD @ SD LAD ML -194 100 W +51.55
MAY 20 BOS @ KC BOS ML -102 50 W +49.02
MAY 20 BAL @ TB TB ML -117 30 W +25.64
MAY 20 ATL @ MIA ATL ML -187 30 W +16.04
MAY 20 ATH @ LAA ATH ML -130 50 W +38.46
MAY 19 TEX @ COL COL ML -110 50 L -50
MAY 19 LAD @ SD LAD ML -160 30 W +18.75
MAY 19 HOU @ MIN MIN ML -141 50 L -50
MAY 19 CWS @ SEA SEA ML -150 30 L -30
MAY 19 CIN @ PHI PHI ML -149 30 L -30
MAY 19 BOS @ KC BOS ML -130 50 W +38.46
MAY 19 ATL @ MIA ATL ML -140 50 W +35.71
MAY 18 TEX @ COL TEX ML -150 50 L -50
MAY 18 NYM @ WSH NYM ML -125 50 W +40
MAY 18 MIL @ CHC CHC ML -162 30 L -30
MAY 18 HOU @ MIN MIN ML -117 100 W +85.47
MAY 18 BOS @ KC KC ML -107 50 L -50
MAY 18 BAL @ TB BAL ML +121 100 L -100
MAY 18 ATL @ MIA MIA ML -114 100 W +87.72
MAY 18 ATH @ LAA ATH ML -133 50 L -50
MAY 11 — MAY 17 27-16 · +342.12 $PP
MAY 17 SF @ ATH ATH ML -149 50 L -50
MAY 17 NYY @ NYM NYM ML -128 100 W +78.12
MAY 17 MIL @ MIN MIN ML +101 50 W +50.5
MAY 17 MIA @ TB TB ML -149 30 W +20.13
MAY 17 LAD @ LAA LAD ML -149 100 W +67.11
MAY 17 KC @ STL KC ML +101 50 W +50.5
MAY 17 CIN @ CLE CLE ML -161 30 W +18.63
MAY 17 CHC @ CWS CHC ML -131 100 L -100
MAY 17 BOS @ ATL ATL ML -145 30 W +20.69
MAY 17 BAL @ WSH BAL ML -130 100 W +76.92
MAY 16 PHI @ PIT PHI ML -182 50 W +27.47
MAY 16 NYY @ NYM NYM ML +105 30 W +31.5
MAY 16 LAD @ LAA LAD ML -136 50 W +36.76
MAY 16 BOS @ ATL ATL ML -119 50 L -50
MAY 16 BAL @ WSH WSH ML -109 30 W +27.52
MAY 15 LAD @ LAA LAD ML -188 100 W +53.19
MAY 15 KC @ STL STL ML -110 30 W +27.27
MAY 15 BAL @ WSH BAL ML -139 100 L -100
MAY 14 WSH @ CIN WSH ML +134 50 L -50
MAY 14 STL @ ATH STL ML -105 30 W +28.57
MAY 14 SF @ LAD LAD ML -177 100 W +56.5
MAY 14 SEA @ HOU SEA ML -130 30 W +23.08
MAY 14 MIA @ MIN MIA ML -115 30 L -30
MAY 14 KC @ CWS KC ML -135 50 L -50
MAY 14 COL @ PIT PIT ML -177 30 W +16.95
MAY 13 STL @ ATH ATH ML -150 30 W +20
MAY 13 PHI @ BOS PHI ML +109 100 L -100
MAY 13 NYY @ BAL NYY ML -164 50 L -50
MAY 13 MIA @ MIN MIA ML -125 30 W +24
MAY 13 LAA @ CLE CLE ML -159 100 W +62.89
MAY 13 COL @ PIT PIT ML -185 30 L -30
MAY 12 WSH @ CIN WSH ML +121 100 W +121
MAY 12 TB @ TOR TOR ML -100 100 L -100
MAY 12 STL @ ATH ATH ML -153 30 L -30
MAY 12 SF @ LAD LAD ML -314 100 L -100
MAY 12 SEA @ HOU SEA ML -195 100 W +51.28
MAY 12 PHI @ BOS PHI ML -145 100 W +68.97
MAY 12 NYY @ BAL NYY ML -140 30 W +21.43
MAY 12 CHC @ ATL CHC ML -100 30 L -30
MAY 11 TB @ TOR TOR ML -122 30 L -30
MAY 11 SF @ LAD SF ML +154 100 W +154
MAY 11 NYY @ BAL NYY ML -155 50 L -50
MAY 11 LAA @ CLE CLE ML -175 100 W +57.14
MAY 4 — MAY 10 20-26 · -788.94 $PP
MAY 10 WSH @ MIA MIA ML -135 30 W +22.22
MAY 10 NYY @ MIL MIL ML -120 50 W +41.67
MAY 10 MIN @ CLE CLE ML -160 30 L -30
MAY 10 HOU @ CIN CIN ML -115 50 W +43.48
MAY 10 COL @ PHI PHI ML -312 50 W +16.03
MAY 10 ATL @ LAD LAD ML -140 50 L -50
MAY 9 WSH @ MIA MIA ML -158 50 W +31.65
MAY 9 TB @ BOS BOS ML -143 50 PUSH
MAY 9 STL @ SD STL ML +119 30 L -30
MAY 9 SEA @ CWS SEA ML -133 100 L -100
MAY 9 PIT @ SF PIT ML -110 30 W +27.27
MAY 9 NYY @ MIL NYY ML -135 30 L -30
MAY 9 ATL @ LAD LAD ML -182 100 L -100
MAY 8 STL @ SD SD ML -151 100 L -100
MAY 8 SEA @ CWS SEA ML -138 30 W +21.74
MAY 8 PIT @ SF SF ML -110 50 W +45.45
MAY 8 LAA @ TOR TOR ML -162 30 W +18.52
MAY 8 HOU @ CIN CIN ML -131 50 L -50
MAY 8 DET @ KC DET ML +124 30 L -30
MAY 8 COL @ PHI PHI ML -227 30 L -30
MAY 8 ATH @ BAL BAL ML -135 50 L -50
MAY 7 TEX @ NYY TEX ML +128 50 L -50
MAY 7 TB @ BOS BOS ML -118 50 L -50
MAY 7 STL @ SD SD ML -175 50 L -50
MAY 7 NYM @ COL NYM ML -142 30 L -30
MAY 7 BAL @ MIA MIA ML -128 50 W +39.06
MAY 6 TOR @ TB TOR ML +123 50 L -50
MAY 6 TEX @ NYY NYY ML -199 30 L -30
MAY 6 SD @ SF SD ML -112 30 W +26.79
MAY 6 LAD @ HOU LAD ML -226 100 W +44.25
MAY 6 CIN @ CHC CHC ML -168 30 W +17.86
MAY 6 BOS @ DET DET ML -126 30 L -30
MAY 6 BAL @ MIA MIA ML -131 50 L -50
MAY 6 ATL @ SEA SEA ML -136 30 W +22.06
MAY 5 TOR @ TB TOR ML +123 30 L -30
MAY 5 TEX @ NYY TEX ML -292 100 L -100
MAY 5 SD @ SF SD ML -175 30 W +17.14
MAY 5 MIN @ WSH MIN ML -136 30 W +22.06
MAY 5 LAD @ HOU LAD ML -226 100 L -100
MAY 5 CWS @ LAA CWS ML -143 30 L -30
MAY 5 CLE @ KC CLE ML -126 50 L -50
MAY 5 CIN @ CHC CHC ML -163 30 W +18.4
MAY 5 BOS @ DET DET ML -236 30 L -30
MAY 5 ATL @ SEA SEA ML -136 50 L -50
MAY 5 ATH @ PHI PHI ML -186 30 W +16.13
MAY 4 LAD @ HOU LAD ML -207 50 W +24.15
MAY 4 BAL @ NYY NYY ML -199 50 W +25.13
APR 27 — MAY 3 1-1 · -76.3 $PP
MAY 2 LAD @ STL LAD ML -326 100 L -100
MAY 1 BAL @ NYY NYY ML -211 50 W +23.7

Picks sourced from the MLB SIM pipeline. Lines via The Odds API. Full methodology at mlbsim.

METHODOLOGY
FEB 16, 2026

MLB ATLAS: 8-STEP PITCHER ARCHETYPE ENGINE

Statcast telemetry → GMM clustering → 3D terrain city. How we classify 5,983 pitcher-seasons into 34 archetypes and generate hitter matchup projections.

FIG. 01: MLB PIPELINE — END-TO-END DATA FLOW
MLB PITCHER ARCHETYPE PIPELINE 8 STEPS • PYTHON LAYER 1: INGESTION STATCAST API pybaseball • pitch-level data BASEBALL SAVANT 2015–2026 • ~700K pitches/yr ROSTER DATA Teams • Rosters • WBC statcast_{year}.parquet — ~150 MB/season • Snappy compressed LAYER 2: FEATURE ENGINEERING 01 FETCH Month-chunked pulls Retry w/ backoff 59 columns kept 02 ROLES SP/RP classification Games started ratio Binary: is_sp 03 FEATURES Pitch mix (10 types) Velo, spin, whiff, arm SV reclassification ZONE LOC Same/opp side splits 9-quadrant entropy 13 zone features PITCHER-SEASON FEATURE VECTOR (14 DIMENSIONS) pct_FF, pct_SI, pct_FC, pct_SL, pct_CH, pct_CU, velo, spin, gb, whiff... LAYER 3: CLUSTERING & CLASSIFICATION 04 GMM CLUSTERING RHP / LHP split independently StandardScaler → BIC optimization Min K=8 • 3D PCA • X-offset ±5 05 ARCHETYPE NAMING Geometric medoid (real pitcher) Rule-based trait scoring 17 names: Snake, Ghost, Barnburner 06-08 MATCHUP ANALYTICS Hitter vs Cluster (wOBA, K%, BB%) Hitter vs Pitcher (head-to-head) Hitter Timing Archetypes LAYER 4: FRONTEND DELIVERY COSMOS ATLAS • Three.js MLB SIM • React Vite → GitHub Pages Python sklearn pandas Three.js Parquet

THE 8-STEP PIPELINE

The MLB system is an 8-step sequential pipeline built in Python, orchestrated by run_all.py. Each step reads the previous step's output and writes its own artifacts. The entire pipeline can be resumed from any step with --from N.

01 Fetch Statcast — Pulls pitch-level telemetry from Baseball Savant via pybaseball. Data is chunked by month (Mar-Oct) with retry logic and polite 2s delays. Each season yields ~700K pitches across 59 columns including release speed, spin rate, pitch movement (pfx_x/pfx_z), plate location, and batted ball outcomes. Saved as compressed Parquet files (~150MB/season).
02 Classify SP/RP Roles — Determines whether each pitcher-season is a Starter or Reliever based on games-started ratio. Produces a binary is_sp flag used downstream for role-aware archetype naming.
03 Feature Engineering — The heaviest step. Aggregates pitch-level data into pitcher-season feature vectors. Computes: pitch mix usage rates (10 types), SV reclassification (SV pitches mapped to CU/SL/ST per pitcher based on velocity and vertical break), spin rates, arm angle (derived from release point geometry), whiff rate, fastball velocity, groundball rate, zone rate, pitch movement vectors, and a 13-feature zone location layer with same-side/opposite-side splits, platoon shifts, and Shannon entropy of 9-quadrant distributions.
04 GMM Clustering — Pitchers are split by handedness (RHP/LHP) and clustered independently using Gaussian Mixture Models. Features are StandardScaled, then GMM is fit across K=2–15 with BIC optimization (minimum K=8 enforced for meaningful granularity). GMM captures soft cluster boundaries and probabilistic membership, which better reflects how pitcher styles blend. Each hand produces 17 clusters. A 3D PCA projection is computed for the Atlas city view, with RHP offset +5 on the X axis and LHP offset −5 to create visual separation.
05 Archetype Naming — Each cluster's geometric medoid (the real pitcher minimizing sum of distances to all cluster members) is identified. A rule-based trait scorer examines the medoid's pitch mix, velocity, spin, and outcomes to assign one of 17 archetype names: Snake, Barnburner, Ghost, Earthworm, Swordfighter, Kitchen Sink, and more. Each archetype gets a consistent color and emoji for the frontend.
06 Hitter vs Cluster — Every pitch is tagged with its pitcher's cluster ID. Plate appearance outcomes are aggregated per batter × cluster × year × batter-side, producing wOBA, BA, SLG, K%, BB%, and whiff% for each matchup combination.
07 Hitter vs Pitcher — Direct head-to-head stats between individual batters and pitchers, providing granular matchup data beyond the cluster-level aggregations.
08 Hitter Timing Archetypes — Classifies hitters by their timing and approach patterns against different pitch types and velocities, adding another dimension to the matchup analysis.
FIG. 02: COSMOS ATLAS — 3D TERRAIN CITY ARCHITECTURE
COSMOS ATLAS — 3D TERRAIN CITY Vanilla JS + Three.js DATA FILES (JSON) clusters.json 34 archetype profiles + colors Medoid PCA (x,y,z) positions pitcher_count, velo, whiff, GB% pitcher_seasons.json 5,983 pitcher-seasons (2015-26) PCA x, y, z coordinates Name, hand, cluster, velo hitter_vs_cluster.json Batter vs archetype stats wOBA, BA, SLG, K%, BB% Min 10 PA threshold batters.json MLB batter directory Name, ID, team, side Autocomplete search THREE.JS 3D SCENE (WebGL + CSS2DRenderer) cosmos.html — STATE: activeBatters[] • selectedStat • selectedYear • minPA • visibleClusters SCENE GEOMETRY TERRAIN HEIGHTFIELD 128×128 subdivided PlaneGeometry Gaussian kernel density sum pitcher_count → hill elevation LA-style rolling topography VORONOI NEIGHBORHOODS 34 districts (17 RHP + 17 LHP) Half-plane bisector clipping ShapeGeometry on terrain surface Inset 3% for street gaps 3D BUILDINGS BoxGeometry per grid cell Height = avg fastball velo Confined inside Voronoi zone PCF soft shadows INTERACTIONS OrbitControls (rotate/zoom) Raycaster click/hover CSS2DObject floating labels Batter route arcs (TubeGeo) REBUILD PIPELINE (on filter change) buildTerrain() → buildNeighborhoods() → buildBuildings() → buildHighway() → buildLabels() → buildRoutes() GitHub Pages • Static • No server • ~130MB JSON data baked in

KEY DESIGN DECISIONS

GMM over K-Means: We use Gaussian Mixture Models instead of K-Means for clustering. GMM captures soft cluster boundaries and probabilistic membership—a pitcher can be 70% Ghost and 30% Swordfighter—which better reflects how real pitching styles blend. Model selection uses BIC (Bayesian Information Criterion) rather than silhouette score, with minimum K=8 enforced per hand.

SV Reclassification: Statcast's “SV” (sweeper) classification is inconsistent across seasons. We built a per-pitcher mapping that examines career-average SV velocity and vertical break to reclassify each pitcher's SV as curveball (pfx_z < −0.50), slider (speed > 84 mph), or sweeper (everything else). This ensures clustering stability across the 2015–2026 dataset.

Separate RHP/LHP Clustering: Rather than clustering all pitchers together, we split by handedness first. This prevents the dominant handedness signal from overwhelming the pitch-mix features. Each hand gets its own StandardScaler, GMM model, and PCA projection. The X-axis offset (+5/−5) in PCA space creates the visual highway divider in the city view.

Medoid over Centroid: Archetype representatives are chosen as the geometric medoid (the real pitcher that minimizes total distance to all cluster members), not the mathematical centroid. This means every archetype profile references an actual pitcher's stats, not a phantom average that no real pitcher matches.

Zone Location Entropy: The 13-feature zone location layer captures not just where pitchers throw, but how predictable their patterns are. Shannon entropy across a 9-quadrant grid (3 lateral × 3 vertical) measures location unpredictability, and platoon shift features capture how much a pitcher adjusts against same-side vs opposite-side batters.

3D TERRAIN CITY VISUALIZATION

LA-Style Terrain Heightfield: The flat ground plane is replaced with a 128×128 subdivided mesh whose vertices are displaced by a Gaussian kernel density function. Each of the 34 cluster centroids emits a Gaussian “hill” with amplitude proportional to log(pitcher_count). Dense clusters like RHP Ghost (542 pitchers) form prominent hilltops; sparse ones like Knuckleball Wizard sit in valleys. The result is an organic, LA-style rolling topography.

Voronoi Neighborhood Tessellation: Each archetype occupies a Voronoi cell computed via half-plane bisector clipping (Sutherland-Hodgman). These polygons are projected onto the terrain surface as ShapeGeometry meshes, with vertices displaced to follow the heightfield. A 3% inset creates natural “street” gaps between districts.

Buildings Confined to Voronoi Zones: Rather than placing buildings at individual pitcher PCA coordinates (which causes cross-cluster color mixing), buildings are packed into a grid of slots inside each Voronoi polygon using point-in-polygon tests. Slots fill center-outward, pitchers distribute round-robin. Each building's height maps to average fastball velocity; bases sit on the terrain surface. This guarantees clean, single-color districts.

Terrain-Following Highway: The RHP/LHP divider at X=0 is built as discrete segments that follow the terrain contour, with emissive orange dashes for the center line. All scene elements—neighborhoods, buildings, labels, batter route arcs—are terrain-aware via bilinear-interpolated height queries.

METHODOLOGY MLB ATLAS DATA ARCHITECTURE DEVLOG
METHODOLOGY
FEB 24, 2026

NBA SIM: 4-PHASE PREDICTION ENGINE

Scheme detection → player archetypes → lineup synergy → DSI spread model. The full pipeline from NBA API to game predictions.

FIG. 03: NBA SIM — COMPLETE SYSTEM ARCHITECTURE
NBA SIM — MULTI-LAYER PREDICTION ENGINE 4 PHASES • PYTHON + SKLEARN PHASE 1: COLLECT (6 COLLECTORS) nba_api Teams, Rosters Season Stats Rate limited: 2s GAME DATA Scores, Box Scores 27 stat columns/game Per-game + advanced LINEUPS 2-man through 5-man Net rating, possessions Min poss thresholds PLAY TYPES SynergyPlayTypes API 11 types × Off/Def PPP, freq%, TO%, FG% BOX SCORES Player per-game USG%, TS%, OРТG, PIE 27 columns each ODDS API the-odds-api.com Spreads + Totals Multi-book consensus SQLite — nba_sim.db — 17 TABLES player_game_stats lineup_stats (2-5 man) team/player_playtypes PHASE 2: ANALYZE (2 ENGINES) COACHING SCHEME CLASSIFIER OFFENSIVE PnR-Heavy, ISO-Heavy Motion, Run-and-Gun Spot-Up, Post-Oriented + Pace (Fast/Mid/Slow) DEFENSIVE Switch-Everything Drop-Coverage, Rim-Protect Trans-Defense, Blitz PPP inversion: low = good D Method: freq/PPP pivot → percentile-rank across 30 teams → weighted scheme scoring Quality tiers: Elite / Good / Average / Poor PLAYER ARCHETYPE CLUSTERER PER-POSITION K-MEANS PG: Floor General, Scoring Guard SG: Sharpshooter, Two-Way Wing C: Rim Protector, Stretch 5 5 positions clustered independently METHODOLOGY Position-weighted features StandardScaler → PCA (8D) K=3-6 via silhouette K=4 bias when Δsil < 0.05 Labels: Hungarian algorithm matches centroids to z-score direction vector templates Optimal bipartite matching → no manual label assignment needed PHASE 3: COMPOSITE VALUE SCORES COMPOSITE VALUE SCORE ENGINE — SYNERGY + BASE + ARCHETYPE FIT SOLO Individual impact w = 0.210 Prior: 500 min 2-MAN Pair synergy w = 0.196 Prior: 30 poss 3-MAN Trio combos w = 0.140 Prior: 50 poss 4-MAN Quad combos w = 0.091 Prior: 75 poss 5-MAN Full lineup w = 0.063 Prior: 100 poss WEIGHT BREAKDOWN Synergy total: 70% Base value: 25% Archetype fit: 5% Bayesian shrinkage priors PHASE 4: PREDICT & DISPLAY PREDICTION ENGINE Feature matrix from value scores Spread + Total predictions Edge = predicted − market line BACKTESTER Train on season N-1 Test on season N Spread/total correct % FRONTEND DASHBOARD generate_frontend.py Single-file HTML • Live odds A/B/C grades • GitHub Pages morellosims.com/nbasim • Static • All data baked in Python sklearn nba_api SQLite scipy

THE 4-PHASE ARCHITECTURE

The NBA SIM operates as a 4-phase CLI pipeline (python main.py [collect|analyze|scores|predict|all]). Each phase builds on the previous, with all data persisted to a 17-table SQLite database.

P1 Collect — Six collectors run in sequence: PlayerCollector pulls teams, rosters, and season stats from nba_api. GameCollector fetches game results. LineupCollector pulls 2-through-5-man lineup combinations with net rating and possession counts (with minimum possession thresholds: 30 for 5-man, 50 for 4-man, 75 for 3-man, 100 for 2-man). PlayTypeCollector calls SynergyPlayTypes for all 11 play types in both offensive and defensive groupings. BoxScoreCollector ingests per-game player stats with 27 columns (points, rebounds, assists, plus advanced metrics like usage rate, true shooting, offensive/defensive rating, PIE). OddsCollector pulls live spreads and totals from The Odds API across multiple bookmakers.
P2 Analyze — Two parallel analysis engines. The Coaching Scheme Classifier builds per-team offensive and defensive profiles by pivoting play type frequencies and PPP values, computing percentile ranks across all 30 teams, then scoring each team against scheme templates (PnR-Heavy, ISO-Heavy, Motion, Run-and-Gun, Spot-Up Heavy, Post-Oriented for offense; Switch-Everything, Drop-Coverage, Rim-Protect, Trans-Defense, Blitz for defense). The Player Archetype Clusterer runs K-Means independently for each of the 5 position groups (PG, SG, SF, PF, C) using position-weighted features, StandardScaler normalization, PCA reduction to 8 components, silhouette-optimized K selection (range 3-6 with a K=4 bias when silhouette delta < 0.05), and Hungarian algorithm label assignment that optimally matches cluster centroids to archetype profile templates defined as z-score direction vectors.
P3 Value Scores — The Composite Value Score for each player is a weighted blend of 6 components. Solo impact (21% weight) measures individual on-court effect. 2-man synergy (19.6%) through 5-man synergy (6.3%) capture how well a player performs in specific lineup combinations, with Bayesian shrinkage priors that pull small-sample estimates toward league average (prior strengths: 500 minutes for solo, 30-100 possessions for multi-man). Base value (25%) covers raw per-36 production. Archetype fit (5%) rewards players whose on-court tendencies match their team's coaching scheme. The synergy portion (70% total) is the core innovation.
P4 Predict — A FeatureEngineer builds training matrices from the value scores and team-level features. A GamePredictor trains models for spread and total predictions. A ModelEvaluator backtests by training on season N-1 and evaluating on season N, measuring spread/total accuracy. The generate_frontend.py script produces a self-contained HTML dashboard that fetches live odds, computes consensus lines across bookmakers, grades matchup edges (A/B/C), and displays today's games with full scheme and archetype context.
FIG. 04: NBA SIM — DATABASE SCHEMA & DATA RELATIONSHIPS
DATABASE SCHEMA — 17 TABLES SQLite • nba_sim.db REFERENCE TABLES teams team_id PK abbreviation, name conference, division players player_id PK name, position height, weight, age roster_assignments player+team+season PK jersey_number FK → teams, players GAME DATA games game_id PK date, home/away team home/away score player_game_stats game+player PK 27 cols: pts, ast, reb USG%, TS%, OРТG, PIE lineup_stats lineup+season PK 2-5 man combos net rtg, possessions lineup_players lineup+season+player Junction table FK → lineup_stats betting_lines game+book+mkt PK price, point retrieved_at timestamp PLAY TYPES & SEASON STATS team_playtypes team+season+type PK freq%, PPP, eFG% TO%, score_freq player_playtypes player+season+type PK Off/Def grouping freq%, PPP, percentile player_season_stats player+season PK 30 cols: per-game + per36 pts, ast, reb, TS%, USG% team_season_stats team+season PK pace, off/def rtg FG%, 3P%, FT%, rates DERIVED & OUTPUT (ANALYSIS PRODUCTS) coaching_profiles team+season PK off/def scheme labels pace, top 3 playstyles player_archetypes player+season PK archetype_label confidence, feature vec player_value_scores player+season PK composite_value float solo + 2/3/4/5-man synergy pair_synergy player_a + player_b net_rating, minutes archetype pair labels predictions game+season PK spread, total edge, confidence collect → analyze → scores → predict • Each phase reads/writes the same SQLite DB • Dashed = derived tables (analysis output)

KEY DESIGN DECISIONS

Percentile-Rank Scheme Classification: Instead of using raw play type frequencies, we rank each team's values against all 30 teams to compute percentile scores (0-1). This ensures meaningful differentiation regardless of season-level shifts in play style trends. A team running 18% isolation isn't inherently "ISO-Heavy" unless they're in the top percentile of the league.

Position-Weighted Clustering: Not all stats matter equally for every position. Centers are weighted toward blocks and rebounds; guards toward assists and three-point attempts. The POSITION_FEATURE_WEIGHTS dictionary applies multipliers before StandardScaler normalization, ensuring PCA captures position-relevant variance. The K=4 bias (accepting K=4 over K=3 when silhouette delta < 0.05) prevents oversimplification.

Hungarian Algorithm for Label Assignment: Each archetype label (e.g., "Floor General", "Rim Protector") is defined as a z-score direction vector. After clustering, we build a cost matrix scoring how well each cluster centroid matches each label template, then use the Hungarian algorithm for optimal bipartite matching. This guarantees the most appropriate label assignment without manual intervention.

Bayesian Shrinkage in Synergy Scores: Small-sample lineup data is unreliable. A 5-man lineup with 35 possessions and +20 net rating shouldn't dominate a player's value. We apply Bayesian priors that shrink estimates toward league average, with prior strength proportional to data granularity (100 possessions for 5-man, 30 for 2-man). This balances signal extraction with noise reduction.

DYNAMIC SCORE (DS)

Every player on the dashboard receives a Dynamic Score (40-99), a single-number composite rating that blends offensive and defensive production into one sortable metric. The formula weights offense at 75% and defense at 25%, reflecting the NBA's offensive-skewing landscape.

Offensive sub-score compounds scoring (pts × 1.2), playmaking (ast × 1.8), efficiency (TS% × 40), and usage (USG% × 15). Defensive sub-score combines stocks (STL × 8.0 + BLK × 6.0) with defensive rating impact (max(0, (115 − DRtg) × 2.5)). Both sub-scores are clamped 0-99. The final blend adds shared components: rebounding (reb × 0.8), net rating impact (NRtg × 0.8), and minutes load (mpg × 0.3), then clamps the result to the 40-99 range. This floor prevents garbage-time players from showing misleadingly low scores.

DSI SPREAD MODEL

The Dynamic Score Index (DSI) is a team-level aggregate that drives the spread prediction engine. For each team, DSI sums the Dynamic Scores of all available starters and rotation players, adjusted for injury absences and usage decay. The spread is computed as a 50/50 blend of DSI-based power rating and adjusted net rating:

Spread = -((DSI_power × 0.50 + NRtg_power × 0.50) + HCA)

Where HCA (home court advantage) = 3.0 points. The model also applies a 3.0 point back-to-back penalty for teams playing consecutive days, and a usage decay factor (0.995 per 1% excess usage for offensive archetypes, 0.985 for defensive archetypes) that taxes players with unsustainably high usage rates. A stocks penalty (0.8 per lost stock) accounts for missing defensive playmakers. The DSI spread is then compared against the market consensus line to generate edge values.

DAILY TRENDS ENGINE

The Trends tab on the dashboard surfaces two layers of daily-refreshing intelligence, both powered by an automated GitHub Actions pipeline that runs every morning at 8 AM PST.

Trending Players: Compares each player's PRA (Points + Rebounds + Assists) over the last 14 days against their PRA from the prior 14 days (28-day total window). Players must have at least 2 games with 15+ minutes in each window to qualify. The top 4 risers and top 4 fallers are surfaced with direction badges: Hot, Trending Up, Cooling Down, Trending Down, or Steady.

Hot & Cold Lineup Combos: Queries the lineup_stats table for 5-man, 3-man, and 2-man combinations that have played at least 5 games and 8 minutes together. Hot combos are ranked by highest net rating, with badges for elite performance: HEATING UP (net > +15, 10+ GP), ELITE FLOOR (net > +10), MORE MINUTES (15+ min, 15+ GP). Cold combos surface the worst-performing lineups with severity badges: DISASTERCLASS (net < −15), COOKED (net < −10), or FADE. Each combo card displays every player's Dynamic Score and archetype for full context.

The daily pipeline uses incremental boxscore collection (only missing games, not the full season), refreshes lineup stats from the NBA.com API with browser-spoofed headers to avoid rate limiting, regenerates the static HTML, and auto-syncs to the live dashboard.

METHODOLOGY NBA DATA ARCHITECTURE DEVLOG