NBA + MLB
🏀
sports simulation × design × methodology

NBA SIM

DAILY MATCHUP SIMULATIONS × PLAYER PROPS × SCHEME ANALYSIS

LAL @ GSW
O/U 234.5
A
BOS @ MIA
BOS -4.5
B
PHX @ DAL
PHX ML
C
OPEN DASHBOARD →

MLB SIM

PITCHING MATCHUPS × BATTING PROPS × SITUATIONAL EDGES

NYY vs BOS
Cole (R)
EV+
LAD vs SDP
Glasnow (R)
N/A
ATL vs PHI
Strider (R)
HOT
OPEN DASHBOARD →

MLB ATLAS

3D PITCHER GALAXY × ARCHETYPE CLUSTERING × HITTER MATCHUPS

34 Archetypes
RHP + LHP
LIVE
5,983 Seasons
2015–2025
K-Means
2,696 Batters
vs Cluster
wOBA
DESKTOP RECOMMENDED
OPEN ATLAS →
DISPATCH_LOG
FEB 16, 2026

BUILDING THE MLB PITCHER ARCHETYPE ENGINE: DATA DESIGN & SYSTEM ARCHITECTURE

A deep dive into how we designed the 8-step data pipeline that transforms raw Statcast pitch-level telemetry into a 3D galaxy of pitcher archetypes. From SV reclassification to K-Means clustering, hitter-vs-cluster matchup matrices, and the Three.js cosmos visualization.

FIG. 01: MLB PIPELINE — END-TO-END DATA FLOW
MLB PITCHER ARCHETYPE PIPELINE 8 STEPS • PYTHON LAYER 1: INGESTION STATCAST API pybaseball • pitch-level data BASEBALL SAVANT 2015–2026 • ~700K pitches/yr ROSTER DATA Teams • Rosters • WBC statcast_{year}.parquet — ~150 MB/season • Snappy compressed LAYER 2: FEATURE ENGINEERING 01 FETCH Month-chunked pulls Retry w/ backoff 59 columns kept 02 ROLES SP/RP classification Games started ratio Binary: is_sp 03 FEATURES Pitch mix (10 types) Velo, spin, whiff, arm SV reclassification ZONE LOC Same/opp side splits 9-quadrant entropy 13 zone features PITCHER-SEASON FEATURE VECTOR (14 DIMENSIONS) pct_FF, pct_SI, pct_FC, pct_SL, pct_CH, pct_CU, velo, spin, gb, whiff... LAYER 3: CLUSTERING & CLASSIFICATION 04 K-MEANS CLUSTERING RHP / LHP split independently StandardScaler → Silhouette opt Min K=8 • 3D PCA • X-offset ±5 05 ARCHETYPE NAMING Geometric medoid (real pitcher) Rule-based trait scoring 17 names: Snake, Ghost, Barnburner 06-08 MATCHUP ANALYTICS Hitter vs Cluster (wOBA, K%, BB%) Hitter vs Pitcher (head-to-head) Hitter Timing Archetypes LAYER 4: FRONTEND DELIVERY COSMOS ATLAS • Three.js MLB SIM • React Vite → GitHub Pages Python sklearn pandas Three.js Parquet

THE 8-STEP PIPELINE

The MLB system is an 8-step sequential pipeline built in Python, orchestrated by run_all.py. Each step reads the previous step's output and writes its own artifacts. The entire pipeline can be resumed from any step with --from N.

01 Fetch Statcast — Pulls pitch-level telemetry from Baseball Savant via pybaseball. Data is chunked by month (Mar-Oct) with retry logic and polite 2s delays. Each season yields ~700K pitches across 59 columns including release speed, spin rate, pitch movement (pfx_x/pfx_z), plate location, and batted ball outcomes. Saved as compressed Parquet files (~150MB/season).
02 Classify SP/RP Roles — Determines whether each pitcher-season is a Starter or Reliever based on games-started ratio. Produces a binary is_sp flag used downstream for role-aware archetype naming.
03 Feature Engineering — The heaviest step. Aggregates pitch-level data into pitcher-season feature vectors. Computes: pitch mix usage rates (10 types), SV reclassification (SV pitches mapped to CU/SL/ST per pitcher based on velocity and vertical break), spin rates, arm angle (derived from release point geometry), whiff rate, fastball velocity, groundball rate, zone rate, pitch movement vectors, and a 13-feature zone location layer with same-side/opposite-side splits, platoon shifts, and Shannon entropy of 9-quadrant distributions.
04 K-Means Clustering — Pitchers are split by handedness (RHP/LHP) and clustered independently. Features are StandardScaled, then K-Means is run across K=2-15 with silhouette score optimization (minimum K=8 enforced for meaningful granularity). Each hand produces ~8-12 clusters. A 3D PCA projection is computed for the Atlas galaxy view, with RHP offset +5 on the X axis and LHP offset -5 to create visual separation.
05 Archetype Naming — Each cluster's geometric medoid (the real pitcher minimizing sum of distances to all cluster members) is identified. A rule-based trait scorer examines the medoid's pitch mix, velocity, spin, and outcomes to assign one of 17 archetype names: Snake, Barnburner, Ghost, Earthworm, Swordfighter, Kitchen Sink, and more. Each archetype gets a consistent color and emoji for the frontend.
06 Hitter vs Cluster — Every pitch is tagged with its pitcher's cluster ID. Plate appearance outcomes are aggregated per batter × cluster × year × batter-side, producing wOBA, BA, SLG, K%, BB%, and whiff% for each matchup combination.
07 Hitter vs Pitcher — Direct head-to-head stats between individual batters and pitchers, providing granular matchup data beyond the cluster-level aggregations.
08 Hitter Timing Archetypes — Classifies hitters by their timing and approach patterns against different pitch types and velocities, adding another dimension to the matchup analysis.
FIG. 02: COSMOS ATLAS — 3D VISUALIZATION ARCHITECTURE
COSMOS ATLAS — 3D PITCHER GALAXY React + Three.js DATA FILES (JSON) clusters.json Archetype profiles + colors Medoid PCA (x,y,z) positions Emoji, velo, whiff, GB% pitcher_seasons.json Every pitcher-season 2015-26 PCA x, y, z coordinates Name, hand, cluster ID hitter_vs_cluster.json Batter vs archetype stats wOBA, BA, SLG, K%, BB% Min 10 PA threshold batters.json MLB batter directory Name, ID, team, side Autocomplete search REACT APPLICATION (VITE) App.jsx — State: activeBatters[] • selectedStat • selectedYear • minPA • visibleClusters • showPitcherDots RENDER COMPONENTS GalaxyScene.jsx Three.js WebGL Canvas Pitcher dots at PCA (x,y,z) Archetype nebula clusters (color-coded) Batter matchup lines to each cluster ControlBar.jsx Year filter dropdown Min PA slider Stat selector (wOBA, K%...) Cluster visibility toggles StatPanel.jsx Matchup stat overlay Red → green gradient by selected stat Up to 5 batters BatterSearch.jsx Autocomplete search batters.json lookup Add to activeBatters Max 5 simultaneous cosmos.html — GitHub Pages • Static • No server • All data baked in

KEY DESIGN DECISIONS

SV Reclassification: Statcast's "SV" (sweeper) classification is inconsistent across seasons. We built a per-pitcher mapping that examines career-average SV velocity and vertical break to reclassify each pitcher's SV as curveball (pfx_z < -0.50), slider (speed > 84 mph), or sweeper (everything else). This ensures clustering stability across the 2015-2026 dataset.

Separate RHP/LHP Clustering: Rather than clustering all pitchers together, we split by handedness first. This prevents the dominant handedness signal from overwhelming the pitch-mix features. Each hand gets its own StandardScaler, K-Means model, and PCA projection. The X-axis offset (+5/-5) in PCA space creates the visual "galaxy" separation in the Atlas view.

Medoid over Centroid: Archetype representatives are chosen as the geometric medoid (the real pitcher that minimizes total distance to all cluster members), not the mathematical centroid. This means every archetype profile references an actual pitcher's stats, not a phantom average that no real pitcher matches.

Zone Location Entropy: The 13-feature zone location layer captures not just where pitchers throw, but how predictable their patterns are. Shannon entropy across a 9-quadrant grid (3 lateral × 3 vertical) measures location unpredictability, and platoon shift features capture how much a pitcher adjusts against same-side vs opposite-side batters.

METHODOLOGY MLB ATLAS DATA ARCHITECTURE DEVLOG
FEB 16, 2026
🏀

NBA SIM: MULTI-LAYER PREDICTION ENGINE — FROM NBA API TO GAME PREDICTIONS

How we built a 4-phase pipeline that ingests NBA player tracking data, classifies coaching schemes via percentile-rank play type analysis, clusters players into position-specific archetypes using weighted K-Means, computes multi-level synergy scores, and generates spread/total predictions against live betting lines.

FIG. 03: NBA SIM — COMPLETE SYSTEM ARCHITECTURE
NBA SIM — MULTI-LAYER PREDICTION ENGINE 4 PHASES • PYTHON + SKLEARN PHASE 1: COLLECT (6 COLLECTORS) nba_api Teams, Rosters Season Stats Rate limited: 2s GAME DATA Scores, Box Scores 27 stat columns/game Per-game + advanced LINEUPS 2-man through 5-man Net rating, possessions Min poss thresholds PLAY TYPES SynergyPlayTypes API 11 types × Off/Def PPP, freq%, TO%, FG% BOX SCORES Player per-game USG%, TS%, OРТG, PIE 27 columns each ODDS API the-odds-api.com Spreads + Totals Multi-book consensus SQLite — nba_sim.db — 17 TABLES player_game_stats lineup_stats (2-5 man) team/player_playtypes PHASE 2: ANALYZE (2 ENGINES) COACHING SCHEME CLASSIFIER OFFENSIVE PnR-Heavy, ISO-Heavy Motion, Run-and-Gun Spot-Up, Post-Oriented + Pace (Fast/Mid/Slow) DEFENSIVE Switch-Everything Drop-Coverage, Rim-Protect Trans-Defense, Blitz PPP inversion: low = good D Method: freq/PPP pivot → percentile-rank across 30 teams → weighted scheme scoring Quality tiers: Elite / Good / Average / Poor PLAYER ARCHETYPE CLUSTERER PER-POSITION K-MEANS PG: Floor General, Scoring Guard SG: Sharpshooter, Two-Way Wing C: Rim Protector, Stretch 5 5 positions clustered independently METHODOLOGY Position-weighted features StandardScaler → PCA (8D) K=3-6 via silhouette K=4 bias when Δsil < 0.05 Labels: Hungarian algorithm matches centroids to z-score direction vector templates Optimal bipartite matching → no manual label assignment needed PHASE 3: COMPOSITE VALUE SCORES COMPOSITE VALUE SCORE ENGINE — SYNERGY + BASE + ARCHETYPE FIT SOLO Individual impact w = 0.210 Prior: 500 min 2-MAN Pair synergy w = 0.196 Prior: 30 poss 3-MAN Trio combos w = 0.140 Prior: 50 poss 4-MAN Quad combos w = 0.091 Prior: 75 poss 5-MAN Full lineup w = 0.063 Prior: 100 poss WEIGHT BREAKDOWN Synergy total: 70% Base value: 25% Archetype fit: 5% Bayesian shrinkage priors PHASE 4: PREDICT & DISPLAY PREDICTION ENGINE Feature matrix from value scores Spread + Total predictions Edge = predicted − market line BACKTESTER Train on season N-1 Test on season N Spread/total correct % FRONTEND DASHBOARD generate_frontend.py Single-file HTML • Live odds A/B/C grades • GitHub Pages nbasim.html — GitHub Pages • Static • All data baked in Python sklearn nba_api SQLite scipy

THE 4-PHASE ARCHITECTURE

The NBA SIM operates as a 4-phase CLI pipeline (python main.py [collect|analyze|scores|predict|all]). Each phase builds on the previous, with all data persisted to a 17-table SQLite database.

P1 Collect — Six collectors run in sequence: PlayerCollector pulls teams, rosters, and season stats from nba_api. GameCollector fetches game results. LineupCollector pulls 2-through-5-man lineup combinations with net rating and possession counts (with minimum possession thresholds: 30 for 5-man, 50 for 4-man, 75 for 3-man, 100 for 2-man). PlayTypeCollector calls SynergyPlayTypes for all 11 play types in both offensive and defensive groupings. BoxScoreCollector ingests per-game player stats with 27 columns (points, rebounds, assists, plus advanced metrics like usage rate, true shooting, offensive/defensive rating, PIE). OddsCollector pulls live spreads and totals from The Odds API across multiple bookmakers.
P2 Analyze — Two parallel analysis engines. The Coaching Scheme Classifier builds per-team offensive and defensive profiles by pivoting play type frequencies and PPP values, computing percentile ranks across all 30 teams, then scoring each team against scheme templates (PnR-Heavy, ISO-Heavy, Motion, Run-and-Gun, Spot-Up Heavy, Post-Oriented for offense; Switch-Everything, Drop-Coverage, Rim-Protect, Trans-Defense, Blitz for defense). The Player Archetype Clusterer runs K-Means independently for each of the 5 position groups (PG, SG, SF, PF, C) using position-weighted features, StandardScaler normalization, PCA reduction to 8 components, silhouette-optimized K selection (range 3-6 with a K=4 bias when silhouette delta < 0.05), and Hungarian algorithm label assignment that optimally matches cluster centroids to archetype profile templates defined as z-score direction vectors.
P3 Value Scores — The Composite Value Score for each player is a weighted blend of 6 components. Solo impact (21% weight) measures individual on-court effect. 2-man synergy (19.6%) through 5-man synergy (6.3%) capture how well a player performs in specific lineup combinations, with Bayesian shrinkage priors that pull small-sample estimates toward league average (prior strengths: 500 minutes for solo, 30-100 possessions for multi-man). Base value (25%) covers raw per-36 production. Archetype fit (5%) rewards players whose on-court tendencies match their team's coaching scheme. The synergy portion (70% total) is the core innovation.
P4 Predict — A FeatureEngineer builds training matrices from the value scores and team-level features. A GamePredictor trains models for spread and total predictions. A ModelEvaluator backtests by training on season N-1 and evaluating on season N, measuring spread/total accuracy. The generate_frontend.py script produces a self-contained HTML dashboard that fetches live odds, computes consensus lines across bookmakers, grades matchup edges (A/B/C), and displays today's games with full scheme and archetype context.
FIG. 04: NBA SIM — DATABASE SCHEMA & DATA RELATIONSHIPS
DATABASE SCHEMA — 17 TABLES SQLite • nba_sim.db REFERENCE TABLES teams team_id PK abbreviation, name conference, division players player_id PK name, position height, weight, age roster_assignments player+team+season PK jersey_number FK → teams, players GAME DATA games game_id PK date, home/away team home/away score player_game_stats game+player PK 27 cols: pts, ast, reb USG%, TS%, OРТG, PIE lineup_stats lineup+season PK 2-5 man combos net rtg, possessions lineup_players lineup+season+player Junction table FK → lineup_stats betting_lines game+book+mkt PK price, point retrieved_at timestamp PLAY TYPES & SEASON STATS team_playtypes team+season+type PK freq%, PPP, eFG% TO%, score_freq player_playtypes player+season+type PK Off/Def grouping freq%, PPP, percentile player_season_stats player+season PK 30 cols: per-game + per36 pts, ast, reb, TS%, USG% team_season_stats team+season PK pace, off/def rtg FG%, 3P%, FT%, rates DERIVED & OUTPUT (ANALYSIS PRODUCTS) coaching_profiles team+season PK off/def scheme labels pace, top 3 playstyles player_archetypes player+season PK archetype_label confidence, feature vec player_value_scores player+season PK composite_value float solo + 2/3/4/5-man synergy pair_synergy player_a + player_b net_rating, minutes archetype pair labels predictions game+season PK spread, total edge, confidence collect → analyze → scores → predict • Each phase reads/writes the same SQLite DB • Dashed = derived tables (analysis output)

KEY DESIGN DECISIONS

Percentile-Rank Scheme Classification: Instead of using raw play type frequencies, we rank each team's values against all 30 teams to compute percentile scores (0-1). This ensures meaningful differentiation regardless of season-level shifts in play style trends. A team running 18% isolation isn't inherently "ISO-Heavy" unless they're in the top percentile of the league.

Position-Weighted Clustering: Not all stats matter equally for every position. Centers are weighted toward blocks and rebounds; guards toward assists and three-point attempts. The POSITION_FEATURE_WEIGHTS dictionary applies multipliers before StandardScaler normalization, ensuring PCA captures position-relevant variance. The K=4 bias (accepting K=4 over K=3 when silhouette delta < 0.05) prevents oversimplification.

Hungarian Algorithm for Label Assignment: Each archetype label (e.g., "Floor General", "Rim Protector") is defined as a z-score direction vector. After clustering, we build a cost matrix scoring how well each cluster centroid matches each label template, then use the Hungarian algorithm for optimal bipartite matching. This guarantees the most appropriate label assignment without manual intervention.

Bayesian Shrinkage in Synergy Scores: Small-sample lineup data is unreliable. A 5-man lineup with 35 possessions and +20 net rating shouldn't dominate a player's value. We apply Bayesian priors that shrink estimates toward league average, with prior strength proportional to data granularity (100 possessions for 5-man, 30 for 2-man). This balances signal extraction with noise reduction.

METHODOLOGY NBA DATA ARCHITECTURE DEVLOG
FEB 16, 2026
🏀

NBA SIM TRACKING (1,000 $PP TO 25,000 $PP) PICKS & RATIONALE

Feb 19 slate — 7 picks sized by model confidence. Starting bankroll: 1,000 $PP. Target: 25,000 $PP. Game lines + player props with full rationale from the NBA SIM pipeline.

Post All-Star break opener. 11 games on the Feb 19 slate — the model flagged 3 game lines and 4 player props worth sizing. Unit sizing ($PP) is based on confidence score: A-grade = 5U, B-grade = 3U, D-grade = 1U. Starting bankroll 1,000 $PP with a target of 25,000 $PP.

BANKROLL
1,000 $PP
TARGET
25,000 $PP
TOTAL RISKED
190 $PP

UNIT KEY: A (90-100) = 50 $PP  ·  B (60-89) = 30 $PP  ·  D (40-59) = 10 $PP

▎ GAME LINES — 3 PICKS

BKN @ CLE — CLE -13.5

50 $PP 100 A

O/U 228.0. Brooklyn (DS #29, 15-38) at Cleveland (DS #3, 34-21). BKN Spot-Up Heavy w/ Drop-Coverage (Poor) vs CLE PnR-Heavy (Fast) w/ Drop-Coverage (Good). DS gap: CLE 372 vs BKN 256. Lineup data: CLE's Merrill/Tyson/Mitchell 3-man core is +29.2 NET RTG over 21 games — elite floor. Their Mobley/Allen/Mitchell 5-man is +25.6 NET RTG. Brooklyn has no trending combos that compete. Max unit.

PHX @ SAS — SAS -6.5

30 $PP 65 B

O/U 225.5. Phoenix (DS #19, 32-23) at San Antonio (DS #6, 37-16). SAS Trans-Defense (Elite) shuts down PHX's PnR-Heavy sets. Lineup data: Wembanyama's 2-man duos are +29.0 NET RTG over 32 games — the largest sample of any trending combo in the league. PHX has zero lineup combos tracking above +10. Team DS 354 vs 321.

ORL @ SAC — ORL -11.0

10 $PP 46 D

O/U 223.5. Orlando (DS #13, 27-23) vs Sacramento (DS #28, 12-44). SAC's Trans-Defense (Poor) vs ORL's Run-and-Gun. Lineup data: SAC has 3 DISASTERCLASS fade combos — Westbrook/Achiuwa/DeRozan/Murray 5-man is -30.7 NET RTG (8 GP), Achiuwa/DeRozan/Sabonis 3-man is -29.3 (11 GP), Achiuwa/Sabonis 2-man is -24.1 (12 GP). Sacramento bleeds points in every lineup combination that gets minutes. Min unit — large spread has juice risk.

▎ PLAYER PROPS — 4 PICKS (sort by: DS ranking)

N. JOKIĆ (DEN vs LAC) — OVER 28.5 PTS

30 $PP DS 99

🔮 Versatile Big · Avg 28.7 pts on 70% TS vs LAC's 114 DRTG. Highest DS in the league. LAC runs ISO-Heavy with Drop-Coverage (Avg) — Jokic's post game feasts on drop schemes. Proj: 28.7p / 10.5a / 11.8r. 📈 Trend: consistent production, floor is the line.

D. MITCHELL (CLE vs BKN) — OVER 27.2 PTS

30 $PP DS 89

⚡ Scoring Guard · Avg 29.0 pts on 62% TS vs BKN's 117 DRTG — worst defense in the league. Mitchell's scoring-guard archetype thrives in fast PnR vs poor drop coverage. Lineup data: Mitchell's combos with Merrill/Tyson (+29.2 NET, 21 GP) and with Mobley/Allen (+25.6 NET, 6 GP) are both elite — he's the engine. Stacks with CLE -13.5.

V. WEMBANYAMA (SAS vs PHX) — OVER 11.0 REB 📈

10 $PP DS 88

🏰 Rim Protector · Avg 11.1 reb, 29 mpg. TRENDING UP over last 5. Lineup data: Wemby's 2-man duos are +29.0 NET RTG over 32 games — biggest sample in the league's hot combos. PHX has no true center to contest boards — Williams (DS 59) is undersized. Wemby's rim-protector archetype dominates the glass against small-ball. Stacks with SAS -6.5. Min unit — rebound props are volatile.

K. LEONARD (LAC vs DEN) — OVER 28.0 PTS 🔥

30 $PP DS 84

🧠 Point Forward · Avg 27.9 pts on 62% TS vs DEN's 116 DRTG. HEATING UP over last 5 games. Kawhi's ISO archetype works against Denver's Rim-Protect (Avg) — pulls Jokic to the perimeter. Proj: 28.3p / 3.6a / 6.2r. Trend-based sizing.

Pick Side Conf $PP Result
BKN @ CLE CLE -13.5 100 A 50
PHX @ SAS SAS -6.5 65 B 30
ORL @ SAC ORL -11.0 46 D 10
Jokić PTS OVER 28.5 DS 99 30
Mitchell PTS OVER 27.2 DS 89 30
Wemby REB OVER 11.0 DS 88 10
Leonard PTS OVER 28.0 DS 84 30
TOTAL RISKED 190 $PP
BANKROLL
1,000
RISKED
190
PICKS
7
RECORD
STATUS
PENDING

All picks sourced from the NBA SIM pipeline — scheme detection, archetype clustering (K-Means on 16 features), Dynamic Score rankings, and lineup synergy. Lines via The Odds API. Full methodology at nbasim.

$PP TRACKING PICKS PLAYER PROPS FEB 19