4-Agent Audit — March 20, 2026

The honest report card

Four independent reviewers graded the Kalshi weather bot from different angles: trading strategy, software engineering, risk analysis, and meteorological science.

Quant Trader C+ Good engine, bad calibration

Engineer B− Clean code, zero tests

Risk Analyst HIGH Settlement broken

Meteorologist C+ Uncalibrated probabilities

The Numbers

Reality check

What the bot claims vs. what's actually happening.

70%+ Claimed win probability MIN_WIN_PROBABILITY

55.6% Actual win rate 10W / 8L settled

0 Trades in 8 days SD filter blocking all

6 Unsettled trades $9.69 at risk, 9 days old

0 Unit tests On code that spends money

10 Days on 2 machines Dual-instance, same API key

+$3.71 Net P&L (local DB) May not match Kalshi

0 Settlement crons running pm2 not installed, no timer

Priority Actions

What needs fixing

Critical Settlement pipeline is completely broken

No automated settlement cron exists — the pm2 config references it but pm2 isn't installed, and systemd has no timer for it. The updateSettlement function has a SQL bind-parameter mismatch (6 values passed to 7 placeholders) meaning it silently fails or corrupts data when called. 6 trades from March 11-12 remain unsettled with $9.69 at risk.

Found by: Risk Analyst, Engineer

Critical Reconcile with Kalshi — local DB is unreliable

The bot ran on two machines simultaneously for 10 days (Cheesegrater via systemd + iMac via PM2), each with separate trades.db files, same API key. Every trade may have been placed twice. The local P&L of +$3.71 is likely wrong. Must pull actual order history from Kalshi API and reconcile.

Found by: Risk Analyst

Critical SD filter is a kill switch — no trades for 8 days

MAX_ENSEMBLE_SD: 3.0 blocks everything. Actual SD across stations is ~7°F, which is normal spring weather uncertainty. The bot is effectively turned off. Raise to 5-6°F, make it per-city, or replace the binary cutoff with a sliding confidence scale.

Found by: Quant Trader, Meteorologist

High Probability estimates are uncalibrated

Raw ensemble member counting is known to be overconfident. The bot requires 70% win probability but achieves 55.6% — a 15-point gap. Every operational weather center applies post-processing (MOS, EMOS, Platt scaling) to fix this. Adding the NBM (National Blend of Models) as a model source and calibrating against backtest data would be the highest-ROI improvement.

Found by: Meteorologist, Quant Trader

High Zero automated tests on money-spending code

The rule engine, probability math, edge calculations, bracket logic, and settlement evaluation have no unit tests. The recent strike_type inversion bug (which caused wrong-direction bets) was caught by observation, not by tests. For a system that autonomously spends real dollars, the probability math needs exhaustive test coverage.

Found by: Engineer

High Chicago may be using the wrong airport

The bot uses KMDW (Midway) for Chicago, but Kalshi may settle against KORD (O'Hare). These stations can differ by 2-4°F, especially for overnight lows. Needs verification against Kalshi's contract specifications.

Found by: Meteorologist

Medium Flat $2 sizing wastes edge

Every trade gets $2 regardless of edge strength. A 180% edge DEN trade got the same capital as a 14% edge CHI trade. Quarter-Kelly sizing proportional to edge would dramatically improve returns while staying conservative.

Found by: Quant Trader

Medium Depth check is too strict

Requires 5 contracts available at the ask even when the bot only wants to buy 2-3. Valid trades with 3-4 contracts available get skipped. Should compare against actual intended order size, not the max.

Found by: Quant Trader

Medium No balance check before placing orders

The bot doesn't call getBalance() before trading. When the account is empty, it hammers the Kalshi API with guaranteed-to-fail orders every 5 minutes until the window closes. Happened 12+ times on Feb 28 and Mar 2.

Found by: Risk Analyst

Medium .env is world-readable + no log timestamps

The .env file with API keys, Discord tokens, and webhook URLs has -rw-r--r-- permissions — should be 600. Additionally, log files have no timestamps on individual lines, making post-mortems difficult.

Found by: Engineer

Detailed Reviews

Grades by reviewer

Quant Trader

Trading strategy & edge detection

C+

The probability engine and risk architecture are genuinely well-built, but the bot is choking itself out of trades. The SD filter at 3.0°F is functionally a kill switch, and flat $2 sizing means the trades that do get through generate negligible returns.

Area	Grade	Notes
Edge Detection	B-	Sound framework, but uses strict `>` instead of `>=` for thresholds. No model bias correction.
Position Sizing	D	Flat $2 is strategically incoherent. A 180% edge trade gets same capital as 14% edge.
Risk Management	B	Good layers (per-city, per-day, dedup) but missing drawdown breaker and correlation control.
Market Selection	B+	Tiered by backtest. Consider dropping Denver (structural high SD).
Execution	B-	Limit orders, fill checking. But depth check is over-strict and no cancel/replace logic.
SD Filter	D	Good concept, catastrophic calibration. Needs 5-6°F or per-model approach.

Software Engineer

Code quality & operational maturity

B−

Clean ESM module structure with good separation of concerns. The 9-step pipeline's checks accumulator is an excellent pattern. But zero test coverage on real-money code is the single biggest engineering risk.

Area	Grade	Notes
Code Quality	B+	Clean, well-documented, consistent naming. Minor formatting issues.
Error Handling	B-	Good try/catch coverage but no retry logic on METAR or forecast fetches.
State Management	B+	SQLite WAL mode, atomic file writes. DB may not checkpoint on shutdown.
Operational Maturity	B-	Discord alerts are good. No log timestamps, no structured logging, no metrics.
Security	C	.env is world-readable. PEM file in project root. One `git add .` from disaster.
Testing	F	Zero tests. The `src/test/` directory exists but is empty.
Deployment	B	systemd service is solid. But 2 unpushed commits, 10 uncommitted files.
Tech Debt	C-	~15 dead Python v1 files, deprecated exports, one-off scripts cluttering root.

Risk Analyst

Failure modes & worst-case scenarios

HIGH

The settlement checker has never run automatically, updateSettlement has a bind-parameter bug, and the true financial position is unknown due to the 10-day dual-instance period. The local database is not the source of truth.

Area	Grade	Notes
Financial Risk	B-	Guardrails enforce caps. But no balance check, no drawdown breaker.
Double-Betting	C+	DB dedup works per-machine. Zero protection across machines (proven failure).
Data Integrity	C	Open-Meteo model name changes can break all forecasts. IEM is unreliable.
API Failures	C+	Graceful degradation per-station. But no retry, no circuit breaker.
Settlement	D	Broken cron, bind-parameter bug, wrong bracket evaluation logic.
Operational	B	systemd restart works. No log rotation, no watchdog, unbounded log growth.
Edge Cases	C+	DST handled. Fall-back overlap hour untested. No market holiday logic.
Dual-Instance Damage	D-	10 days of potential duplicate orders. True P&L unknown.

Meteorologist

Weather science & forecast quality

C+

The architecture is sound but raw ensemble counting is overconfident — 55% actual win rate vs 70% claimed is the smoking gun. Adding NBM and applying calibration (Platt scaling or isotonic regression) would be the highest-ROI improvement.

Area	Grade	Notes
Model Selection	B+	Good global ensemble mix. Missing NBM (the best post-processed US product).
Model Weights	C+	Static, not empirically calibrated. GEM at 15% is too high. Should vary by lead time.
Probability Math	C	Raw member counting is underdispersive. Needs KDE or calibration layer.
SD Filter	B-	Smart concept, needs per-city and per-season thresholds.
Station Selection	B	Mostly correct. KMDW for Chicago may be wrong — Kalshi may use KORD.
METAR Usage	B-	IEM is unreliable. 60-min staleness too generous. Needs fallback source.
Forecast Timing	B+	Well-designed windows. Missing intraday hedging opportunity.
Seasonal Adaptation	D	Everything is fixed year-round. Weights, SD, and bias all vary by season.

Next Steps

Prioritized action plan

Fix now

1
Fix and automate settlement Add systemd timer. Fix updateSettlement bind-parameter bug. Run settlement manually to clear 6 pending trades.
2
Reconcile with Kalshi API Pull actual order history from Kalshi. Compare against both trades.db files. Determine true P&L and exposure.
3
Raise MAX_ENSEMBLE_SD to 5-6°F Or make it per-city (3F coastal, 5F inland). Unblocks trading immediately.

This week

4
Verify Chicago settlement station Check Kalshi contract specs for KMDW vs KORD. Fix if wrong.
5
chmod 600 the .env file API keys, Discord tokens, webhook URLs are world-readable. One command.
6
Add balance check before trading Call getBalance() at window start. Skip trading if below threshold.
7
Write unit tests for probability math and edge calculation Cover computeWeightedProbability, calculateBuyEdge, parseContract, and bracket logic.

Next sprint

8
Implement probability calibration Platt scaling or isotonic regression against backtest data. Investigate adding NBM as a model source.
9
Switch from flat $2 to quarter-Kelly sizing Size proportional to edge. Floor at 1 contract, cap at MAX_CONTRACTS_PER_ORDER.
10
Fix depth check to compare against actual order size Currently requires 5 contracts at ask even when buying 2. Skips valid trades.
11
Add seasonal weight and threshold adaptation Model weights, SD thresholds, and bias corrections should vary by season.
12
Clean up tech debt Remove 15 dead Python v1 files, deprecated exports, one-off scripts. Push to GitHub.