The Numbers
Reality check
What the bot claims vs. what's actually happening.
70%+
Claimed win probability
MIN_WIN_PROBABILITY
55.6%
Actual win rate
10W / 8L settled
0
Trades in 8 days
SD filter blocking all
6
Unsettled trades
$9.69 at risk, 9 days old
0
Unit tests
On code that spends money
10
Days on 2 machines
Dual-instance, same API key
+$3.71
Net P&L (local DB)
May not match Kalshi
0
Settlement crons running
pm2 not installed, no timer
Priority Actions
What needs fixing
No automated settlement cron exists — the pm2 config references it but pm2 isn't installed, and systemd has no timer for it. The updateSettlement function has a SQL bind-parameter mismatch (6 values passed to 7 placeholders) meaning it silently fails or corrupts data when called. 6 trades from March 11-12 remain unsettled with $9.69 at risk.
Found by: Risk Analyst, Engineer
The bot ran on two machines simultaneously for 10 days (Cheesegrater via systemd + iMac via PM2), each with separate trades.db files, same API key. Every trade may have been placed twice. The local P&L of +$3.71 is likely wrong. Must pull actual order history from Kalshi API and reconcile.
Found by: Risk Analyst
MAX_ENSEMBLE_SD: 3.0 blocks everything. Actual SD across stations is ~7°F, which is normal spring weather uncertainty. The bot is effectively turned off. Raise to 5-6°F, make it per-city, or replace the binary cutoff with a sliding confidence scale.
Found by: Quant Trader, Meteorologist
Raw ensemble member counting is known to be overconfident. The bot requires 70% win probability but achieves 55.6% — a 15-point gap. Every operational weather center applies post-processing (MOS, EMOS, Platt scaling) to fix this. Adding the NBM (National Blend of Models) as a model source and calibrating against backtest data would be the highest-ROI improvement.
Found by: Meteorologist, Quant Trader
The rule engine, probability math, edge calculations, bracket logic, and settlement evaluation have no unit tests. The recent strike_type inversion bug (which caused wrong-direction bets) was caught by observation, not by tests. For a system that autonomously spends real dollars, the probability math needs exhaustive test coverage.
Found by: Engineer
The bot uses KMDW (Midway) for Chicago, but Kalshi may settle against KORD (O'Hare). These stations can differ by 2-4°F, especially for overnight lows. Needs verification against Kalshi's contract specifications.
Found by: Meteorologist
Every trade gets $2 regardless of edge strength. A 180% edge DEN trade got the same capital as a 14% edge CHI trade. Quarter-Kelly sizing proportional to edge would dramatically improve returns while staying conservative.
Found by: Quant Trader
Requires 5 contracts available at the ask even when the bot only wants to buy 2-3. Valid trades with 3-4 contracts available get skipped. Should compare against actual intended order size, not the max.
Found by: Quant Trader
The bot doesn't call getBalance() before trading. When the account is empty, it hammers the Kalshi API with guaranteed-to-fail orders every 5 minutes until the window closes. Happened 12+ times on Feb 28 and Mar 2.
Found by: Risk Analyst
The .env file with API keys, Discord tokens, and webhook URLs has -rw-r--r-- permissions — should be 600. Additionally, log files have no timestamps on individual lines, making post-mortems difficult.
Found by: Engineer
Detailed Reviews
Grades by reviewer
The probability engine and risk architecture are genuinely well-built, but the bot is choking itself out of trades. The SD filter at 3.0°F is functionally a kill switch, and flat $2 sizing means the trades that do get through generate negligible returns.
| Area | Grade | Notes |
| Edge Detection | B- | Sound framework, but uses strict > instead of >= for thresholds. No model bias correction. |
| Position Sizing | D | Flat $2 is strategically incoherent. A 180% edge trade gets same capital as 14% edge. |
| Risk Management | B | Good layers (per-city, per-day, dedup) but missing drawdown breaker and correlation control. |
| Market Selection | B+ | Tiered by backtest. Consider dropping Denver (structural high SD). |
| Execution | B- | Limit orders, fill checking. But depth check is over-strict and no cancel/replace logic. |
| SD Filter | D | Good concept, catastrophic calibration. Needs 5-6°F or per-model approach. |
Clean ESM module structure with good separation of concerns. The 9-step pipeline's checks accumulator is an excellent pattern. But zero test coverage on real-money code is the single biggest engineering risk.
| Area | Grade | Notes |
| Code Quality | B+ | Clean, well-documented, consistent naming. Minor formatting issues. |
| Error Handling | B- | Good try/catch coverage but no retry logic on METAR or forecast fetches. |
| State Management | B+ | SQLite WAL mode, atomic file writes. DB may not checkpoint on shutdown. |
| Operational Maturity | B- | Discord alerts are good. No log timestamps, no structured logging, no metrics. |
| Security | C | .env is world-readable. PEM file in project root. One git add . from disaster. |
| Testing | F | Zero tests. The src/test/ directory exists but is empty. |
| Deployment | B | systemd service is solid. But 2 unpushed commits, 10 uncommitted files. |
| Tech Debt | C- | ~15 dead Python v1 files, deprecated exports, one-off scripts cluttering root. |
The settlement checker has never run automatically, updateSettlement has a bind-parameter bug, and the true financial position is unknown due to the 10-day dual-instance period. The local database is not the source of truth.
| Area | Grade | Notes |
| Financial Risk | B- | Guardrails enforce caps. But no balance check, no drawdown breaker. |
| Double-Betting | C+ | DB dedup works per-machine. Zero protection across machines (proven failure). |
| Data Integrity | C | Open-Meteo model name changes can break all forecasts. IEM is unreliable. |
| API Failures | C+ | Graceful degradation per-station. But no retry, no circuit breaker. |
| Settlement | D | Broken cron, bind-parameter bug, wrong bracket evaluation logic. |
| Operational | B | systemd restart works. No log rotation, no watchdog, unbounded log growth. |
| Edge Cases | C+ | DST handled. Fall-back overlap hour untested. No market holiday logic. |
| Dual-Instance Damage | D- | 10 days of potential duplicate orders. True P&L unknown. |
The architecture is sound but raw ensemble counting is overconfident — 55% actual win rate vs 70% claimed is the smoking gun. Adding NBM and applying calibration (Platt scaling or isotonic regression) would be the highest-ROI improvement.
| Area | Grade | Notes |
| Model Selection | B+ | Good global ensemble mix. Missing NBM (the best post-processed US product). |
| Model Weights | C+ | Static, not empirically calibrated. GEM at 15% is too high. Should vary by lead time. |
| Probability Math | C | Raw member counting is underdispersive. Needs KDE or calibration layer. |
| SD Filter | B- | Smart concept, needs per-city and per-season thresholds. |
| Station Selection | B | Mostly correct. KMDW for Chicago may be wrong — Kalshi may use KORD. |
| METAR Usage | B- | IEM is unreliable. 60-min staleness too generous. Needs fallback source. |
| Forecast Timing | B+ | Well-designed windows. Missing intraday hedging opportunity. |
| Seasonal Adaptation | D | Everything is fixed year-round. Weights, SD, and bias all vary by season. |