4-Agent Audit — March 20, 2026

The honest report card

Four independent reviewers graded the Kalshi weather bot from different angles: trading strategy, software engineering, risk analysis, and meteorological science.

Quant Trader C+ Good engine, bad calibration
Engineer B− Clean code, zero tests
Risk Analyst HIGH Settlement broken
Meteorologist C+ Uncalibrated probabilities
The Numbers

Reality check

What the bot claims vs. what's actually happening.

70%+ Claimed win probability MIN_WIN_PROBABILITY
55.6% Actual win rate 10W / 8L settled
0 Trades in 8 days SD filter blocking all
6 Unsettled trades $9.69 at risk, 9 days old
0 Unit tests On code that spends money
10 Days on 2 machines Dual-instance, same API key
+$3.71 Net P&L (local DB) May not match Kalshi
0 Settlement crons running pm2 not installed, no timer
Priority Actions

What needs fixing

Critical Settlement pipeline is completely broken
No automated settlement cron exists — the pm2 config references it but pm2 isn't installed, and systemd has no timer for it. The updateSettlement function has a SQL bind-parameter mismatch (6 values passed to 7 placeholders) meaning it silently fails or corrupts data when called. 6 trades from March 11-12 remain unsettled with $9.69 at risk.
Found by: Risk Analyst, Engineer
Critical Reconcile with Kalshi — local DB is unreliable
The bot ran on two machines simultaneously for 10 days (Cheesegrater via systemd + iMac via PM2), each with separate trades.db files, same API key. Every trade may have been placed twice. The local P&L of +$3.71 is likely wrong. Must pull actual order history from Kalshi API and reconcile.
Found by: Risk Analyst
Critical SD filter is a kill switch — no trades for 8 days
MAX_ENSEMBLE_SD: 3.0 blocks everything. Actual SD across stations is ~7°F, which is normal spring weather uncertainty. The bot is effectively turned off. Raise to 5-6°F, make it per-city, or replace the binary cutoff with a sliding confidence scale.
Found by: Quant Trader, Meteorologist
High Probability estimates are uncalibrated
Raw ensemble member counting is known to be overconfident. The bot requires 70% win probability but achieves 55.6% — a 15-point gap. Every operational weather center applies post-processing (MOS, EMOS, Platt scaling) to fix this. Adding the NBM (National Blend of Models) as a model source and calibrating against backtest data would be the highest-ROI improvement.
Found by: Meteorologist, Quant Trader
High Zero automated tests on money-spending code
The rule engine, probability math, edge calculations, bracket logic, and settlement evaluation have no unit tests. The recent strike_type inversion bug (which caused wrong-direction bets) was caught by observation, not by tests. For a system that autonomously spends real dollars, the probability math needs exhaustive test coverage.
Found by: Engineer
High Chicago may be using the wrong airport
The bot uses KMDW (Midway) for Chicago, but Kalshi may settle against KORD (O'Hare). These stations can differ by 2-4°F, especially for overnight lows. Needs verification against Kalshi's contract specifications.
Found by: Meteorologist
Medium Flat $2 sizing wastes edge
Every trade gets $2 regardless of edge strength. A 180% edge DEN trade got the same capital as a 14% edge CHI trade. Quarter-Kelly sizing proportional to edge would dramatically improve returns while staying conservative.
Found by: Quant Trader
Medium Depth check is too strict
Requires 5 contracts available at the ask even when the bot only wants to buy 2-3. Valid trades with 3-4 contracts available get skipped. Should compare against actual intended order size, not the max.
Found by: Quant Trader
Medium No balance check before placing orders
The bot doesn't call getBalance() before trading. When the account is empty, it hammers the Kalshi API with guaranteed-to-fail orders every 5 minutes until the window closes. Happened 12+ times on Feb 28 and Mar 2.
Found by: Risk Analyst
Medium .env is world-readable + no log timestamps
The .env file with API keys, Discord tokens, and webhook URLs has -rw-r--r-- permissions — should be 600. Additionally, log files have no timestamps on individual lines, making post-mortems difficult.
Found by: Engineer
Detailed Reviews

Grades by reviewer

Quant Trader
Trading strategy & edge detection
C+
The probability engine and risk architecture are genuinely well-built, but the bot is choking itself out of trades. The SD filter at 3.0°F is functionally a kill switch, and flat $2 sizing means the trades that do get through generate negligible returns.
AreaGradeNotes
Edge DetectionB-Sound framework, but uses strict > instead of >= for thresholds. No model bias correction.
Position SizingDFlat $2 is strategically incoherent. A 180% edge trade gets same capital as 14% edge.
Risk ManagementBGood layers (per-city, per-day, dedup) but missing drawdown breaker and correlation control.
Market SelectionB+Tiered by backtest. Consider dropping Denver (structural high SD).
ExecutionB-Limit orders, fill checking. But depth check is over-strict and no cancel/replace logic.
SD FilterDGood concept, catastrophic calibration. Needs 5-6°F or per-model approach.
Software Engineer
Code quality & operational maturity
B−
Clean ESM module structure with good separation of concerns. The 9-step pipeline's checks accumulator is an excellent pattern. But zero test coverage on real-money code is the single biggest engineering risk.
AreaGradeNotes
Code QualityB+Clean, well-documented, consistent naming. Minor formatting issues.
Error HandlingB-Good try/catch coverage but no retry logic on METAR or forecast fetches.
State ManagementB+SQLite WAL mode, atomic file writes. DB may not checkpoint on shutdown.
Operational MaturityB-Discord alerts are good. No log timestamps, no structured logging, no metrics.
SecurityC.env is world-readable. PEM file in project root. One git add . from disaster.
TestingFZero tests. The src/test/ directory exists but is empty.
DeploymentBsystemd service is solid. But 2 unpushed commits, 10 uncommitted files.
Tech DebtC-~15 dead Python v1 files, deprecated exports, one-off scripts cluttering root.
Risk Analyst
Failure modes & worst-case scenarios
HIGH
The settlement checker has never run automatically, updateSettlement has a bind-parameter bug, and the true financial position is unknown due to the 10-day dual-instance period. The local database is not the source of truth.
AreaGradeNotes
Financial RiskB-Guardrails enforce caps. But no balance check, no drawdown breaker.
Double-BettingC+DB dedup works per-machine. Zero protection across machines (proven failure).
Data IntegrityCOpen-Meteo model name changes can break all forecasts. IEM is unreliable.
API FailuresC+Graceful degradation per-station. But no retry, no circuit breaker.
SettlementDBroken cron, bind-parameter bug, wrong bracket evaluation logic.
OperationalBsystemd restart works. No log rotation, no watchdog, unbounded log growth.
Edge CasesC+DST handled. Fall-back overlap hour untested. No market holiday logic.
Dual-Instance DamageD-10 days of potential duplicate orders. True P&L unknown.
Meteorologist
Weather science & forecast quality
C+
The architecture is sound but raw ensemble counting is overconfident — 55% actual win rate vs 70% claimed is the smoking gun. Adding NBM and applying calibration (Platt scaling or isotonic regression) would be the highest-ROI improvement.
AreaGradeNotes
Model SelectionB+Good global ensemble mix. Missing NBM (the best post-processed US product).
Model WeightsC+Static, not empirically calibrated. GEM at 15% is too high. Should vary by lead time.
Probability MathCRaw member counting is underdispersive. Needs KDE or calibration layer.
SD FilterB-Smart concept, needs per-city and per-season thresholds.
Station SelectionBMostly correct. KMDW for Chicago may be wrong — Kalshi may use KORD.
METAR UsageB-IEM is unreliable. 60-min staleness too generous. Needs fallback source.
Forecast TimingB+Well-designed windows. Missing intraday hedging opportunity.
Seasonal AdaptationDEverything is fixed year-round. Weights, SD, and bias all vary by season.
Next Steps

Prioritized action plan

Fix now

  • 1
    Fix and automate settlement Add systemd timer. Fix updateSettlement bind-parameter bug. Run settlement manually to clear 6 pending trades.
  • 2
    Reconcile with Kalshi API Pull actual order history from Kalshi. Compare against both trades.db files. Determine true P&L and exposure.
  • 3
    Raise MAX_ENSEMBLE_SD to 5-6°F Or make it per-city (3F coastal, 5F inland). Unblocks trading immediately.

This week

  • 4
    Verify Chicago settlement station Check Kalshi contract specs for KMDW vs KORD. Fix if wrong.
  • 5
    chmod 600 the .env file API keys, Discord tokens, webhook URLs are world-readable. One command.
  • 6
    Add balance check before trading Call getBalance() at window start. Skip trading if below threshold.
  • 7
    Write unit tests for probability math and edge calculation Cover computeWeightedProbability, calculateBuyEdge, parseContract, and bracket logic.

Next sprint

  • 8
    Implement probability calibration Platt scaling or isotonic regression against backtest data. Investigate adding NBM as a model source.
  • 9
    Switch from flat $2 to quarter-Kelly sizing Size proportional to edge. Floor at 1 contract, cap at MAX_CONTRACTS_PER_ORDER.
  • 10
    Fix depth check to compare against actual order size Currently requires 5 contracts at ask even when buying 2. Skips valid trades.
  • 11
    Add seasonal weight and threshold adaptation Model weights, SD thresholds, and bias corrections should vary by season.
  • 12
    Clean up tech debt Remove 15 dead Python v1 files, deprecated exports, one-off scripts. Push to GitHub.