13 picks across 12 settled. Auto-rendered from picks/nba.json.
Picks sourced from the NBA SIM pipeline. Lines via The Odds API. Full methodology at nbasim.
161 picks across 152 settled. Auto-rendered from picks/mlb.json.
Picks sourced from the MLB SIM pipeline. Lines via The Odds API. Full methodology at mlbsim.
Statcast telemetry → GMM clustering → 3D terrain city. How we classify 5,983 pitcher-seasons into 34 archetypes and generate hitter matchup projections.
The MLB system is an 8-step sequential pipeline built in Python, orchestrated by run_all.py. Each step reads the previous step's output and writes its own artifacts. The entire pipeline can be resumed from any step with --from N.
pybaseball. Data is chunked by month (Mar-Oct) with retry logic and polite 2s delays. Each season yields ~700K pitches across 59 columns including release speed, spin rate, pitch movement (pfx_x/pfx_z), plate location, and batted ball outcomes. Saved as compressed Parquet files (~150MB/season).
is_sp flag used downstream for role-aware archetype naming.
GMM over K-Means: We use Gaussian Mixture Models instead of K-Means for clustering. GMM captures soft cluster boundaries and probabilistic membership—a pitcher can be 70% Ghost and 30% Swordfighter—which better reflects how real pitching styles blend. Model selection uses BIC (Bayesian Information Criterion) rather than silhouette score, with minimum K=8 enforced per hand.
SV Reclassification: Statcast's “SV” (sweeper) classification is inconsistent across seasons. We built a per-pitcher mapping that examines career-average SV velocity and vertical break to reclassify each pitcher's SV as curveball (pfx_z < −0.50), slider (speed > 84 mph), or sweeper (everything else). This ensures clustering stability across the 2015–2026 dataset.
Separate RHP/LHP Clustering: Rather than clustering all pitchers together, we split by handedness first. This prevents the dominant handedness signal from overwhelming the pitch-mix features. Each hand gets its own StandardScaler, GMM model, and PCA projection. The X-axis offset (+5/−5) in PCA space creates the visual highway divider in the city view.
Medoid over Centroid: Archetype representatives are chosen as the geometric medoid (the real pitcher that minimizes total distance to all cluster members), not the mathematical centroid. This means every archetype profile references an actual pitcher's stats, not a phantom average that no real pitcher matches.
Zone Location Entropy: The 13-feature zone location layer captures not just where pitchers throw, but how predictable their patterns are. Shannon entropy across a 9-quadrant grid (3 lateral × 3 vertical) measures location unpredictability, and platoon shift features capture how much a pitcher adjusts against same-side vs opposite-side batters.
LA-Style Terrain Heightfield: The flat ground plane is replaced with a 128×128 subdivided mesh whose vertices are displaced by a Gaussian kernel density function. Each of the 34 cluster centroids emits a Gaussian “hill” with amplitude proportional to log(pitcher_count). Dense clusters like RHP Ghost (542 pitchers) form prominent hilltops; sparse ones like Knuckleball Wizard sit in valleys. The result is an organic, LA-style rolling topography.
Voronoi Neighborhood Tessellation: Each archetype occupies a Voronoi cell computed via half-plane bisector clipping (Sutherland-Hodgman). These polygons are projected onto the terrain surface as ShapeGeometry meshes, with vertices displaced to follow the heightfield. A 3% inset creates natural “street” gaps between districts.
Buildings Confined to Voronoi Zones: Rather than placing buildings at individual pitcher PCA coordinates (which causes cross-cluster color mixing), buildings are packed into a grid of slots inside each Voronoi polygon using point-in-polygon tests. Slots fill center-outward, pitchers distribute round-robin. Each building's height maps to average fastball velocity; bases sit on the terrain surface. This guarantees clean, single-color districts.
Terrain-Following Highway: The RHP/LHP divider at X=0 is built as discrete segments that follow the terrain contour, with emissive orange dashes for the center line. All scene elements—neighborhoods, buildings, labels, batter route arcs—are terrain-aware via bilinear-interpolated height queries.
Scheme detection → player archetypes → lineup synergy → DSI spread model. The full pipeline from NBA API to game predictions.
The NBA SIM operates as a 4-phase CLI pipeline (python main.py [collect|analyze|scores|predict|all]). Each phase builds on the previous, with all data persisted to a 17-table SQLite database.
PlayerCollector pulls teams, rosters, and season stats from nba_api. GameCollector fetches game results. LineupCollector pulls 2-through-5-man lineup combinations with net rating and possession counts (with minimum possession thresholds: 30 for 5-man, 50 for 4-man, 75 for 3-man, 100 for 2-man). PlayTypeCollector calls SynergyPlayTypes for all 11 play types in both offensive and defensive groupings. BoxScoreCollector ingests per-game player stats with 27 columns (points, rebounds, assists, plus advanced metrics like usage rate, true shooting, offensive/defensive rating, PIE). OddsCollector pulls live spreads and totals from The Odds API across multiple bookmakers.
FeatureEngineer builds training matrices from the value scores and team-level features. A GamePredictor trains models for spread and total predictions. A ModelEvaluator backtests by training on season N-1 and evaluating on season N, measuring spread/total accuracy. The generate_frontend.py script produces a self-contained HTML dashboard that fetches live odds, computes consensus lines across bookmakers, grades matchup edges (A/B/C), and displays today's games with full scheme and archetype context.
Percentile-Rank Scheme Classification: Instead of using raw play type frequencies, we rank each team's values against all 30 teams to compute percentile scores (0-1). This ensures meaningful differentiation regardless of season-level shifts in play style trends. A team running 18% isolation isn't inherently "ISO-Heavy" unless they're in the top percentile of the league.
Position-Weighted Clustering: Not all stats matter equally for every position. Centers are weighted toward blocks and rebounds; guards toward assists and three-point attempts. The POSITION_FEATURE_WEIGHTS dictionary applies multipliers before StandardScaler normalization, ensuring PCA captures position-relevant variance. The K=4 bias (accepting K=4 over K=3 when silhouette delta < 0.05) prevents oversimplification.
Hungarian Algorithm for Label Assignment: Each archetype label (e.g., "Floor General", "Rim Protector") is defined as a z-score direction vector. After clustering, we build a cost matrix scoring how well each cluster centroid matches each label template, then use the Hungarian algorithm for optimal bipartite matching. This guarantees the most appropriate label assignment without manual intervention.
Bayesian Shrinkage in Synergy Scores: Small-sample lineup data is unreliable. A 5-man lineup with 35 possessions and +20 net rating shouldn't dominate a player's value. We apply Bayesian priors that shrink estimates toward league average, with prior strength proportional to data granularity (100 possessions for 5-man, 30 for 2-man). This balances signal extraction with noise reduction.
Every player on the dashboard receives a Dynamic Score (40-99), a single-number composite rating that blends offensive and defensive production into one sortable metric. The formula weights offense at 75% and defense at 25%, reflecting the NBA's offensive-skewing landscape.
Offensive sub-score compounds scoring (pts × 1.2), playmaking (ast × 1.8), efficiency (TS% × 40), and usage (USG% × 15). Defensive sub-score combines stocks (STL × 8.0 + BLK × 6.0) with defensive rating impact (max(0, (115 − DRtg) × 2.5)). Both sub-scores are clamped 0-99. The final blend adds shared components: rebounding (reb × 0.8), net rating impact (NRtg × 0.8), and minutes load (mpg × 0.3), then clamps the result to the 40-99 range. This floor prevents garbage-time players from showing misleadingly low scores.
The Dynamic Score Index (DSI) is a team-level aggregate that drives the spread prediction engine. For each team, DSI sums the Dynamic Scores of all available starters and rotation players, adjusted for injury absences and usage decay. The spread is computed as a 50/50 blend of DSI-based power rating and adjusted net rating:
Spread = -((DSI_power × 0.50 + NRtg_power × 0.50) + HCA)
Where HCA (home court advantage) = 3.0 points. The model also applies a 3.0 point back-to-back penalty for teams playing consecutive days, and a usage decay factor (0.995 per 1% excess usage for offensive archetypes, 0.985 for defensive archetypes) that taxes players with unsustainably high usage rates. A stocks penalty (0.8 per lost stock) accounts for missing defensive playmakers. The DSI spread is then compared against the market consensus line to generate edge values.
The Trends tab on the dashboard surfaces two layers of daily-refreshing intelligence, both powered by an automated GitHub Actions pipeline that runs every morning at 8 AM PST.
Trending Players: Compares each player's PRA (Points + Rebounds + Assists) over the last 14 days against their PRA from the prior 14 days (28-day total window). Players must have at least 2 games with 15+ minutes in each window to qualify. The top 4 risers and top 4 fallers are surfaced with direction badges: Hot, Trending Up, Cooling Down, Trending Down, or Steady.
Hot & Cold Lineup Combos: Queries the lineup_stats table for 5-man, 3-man, and 2-man combinations that have played at least 5 games and 8 minutes together. Hot combos are ranked by highest net rating, with badges for elite performance: HEATING UP (net > +15, 10+ GP), ELITE FLOOR (net > +10), MORE MINUTES (15+ min, 15+ GP). Cold combos surface the worst-performing lineups with severity badges: DISASTERCLASS (net < −15), COOKED (net < −10), or FADE. Each combo card displays every player's Dynamic Score and archetype for full context.
The daily pipeline uses incremental boxscore collection (only missing games, not the full season), refreshes lineup stats from the NBA.com API with browser-spoofed headers to avoid rate limiting, regenerates the static HTML, and auto-syncs to the live dashboard.