Lessons from the Lab
What Baseline vs Iris Shows, and What It Does Not
May 19, 2026
The Baseline vs Iris comparison in Verdify Lab is intentionally framed as operational evidence, not a controlled A/B test.
That wording is important. The offline and online windows had different weather, humidity, solar load, and operating conditions. A serious proof system should make those caveats visible instead of hiding them behind a clean chart.
What the comparison shows
The comparison is still useful because it makes operational questions inspectable:
- Was the planner available?
- Were plans produced consistently?
- Did climate compliance, stress hours, water, energy, cost, and planner score move?
- What confounders might explain part of the movement?
- What should be tested next?
That is a more useful standard than a vague success story. The point is not to claim magic causality. The point is to make the evidence strong enough to discuss.
Translate this to your workflow
Your AI workflow needs the same discipline. A before/after comparison can be useful, but only if the scorecard names the caveats: workload mix, reviewer staffing, customer seasonality, data quality, policy changes, and process changes.
Verdify uses this evidence posture in the AI Operations Scorecard and the Scorecard Template.