Machine learning teams frequently deploy models based on point estimates like AUROC or accuracy, but these numbers often hide critical uncertainty. Without confidence intervals, a minor metric difference could vanish with a new dataset, leading to unreliable systems in production. A new open-source tool called reliably-metrics is changing this by delivering statistically rigorous evaluation out of the box.
The Hidden Risk in Today’s ML Evaluations
Most model comparisons rely on bare floats like AUROC = 0.8512, presented with misleading precision. In reality, metrics are estimates derived from finite test sets, and their true values fluctuate with every new batch of data. When two models differ by just 0.004 in AUROC, teams rarely know whether this gap reflects a real improvement or random noise.
Consider evaluating two models on 500 test samples:
- Model A: AUROC = 0.847
- Model B: AUROC = 0.851
The 0.004 difference seems trivial, but without uncertainty quantification, teams cannot determine if this gap is meaningful. This gap could disappear entirely with a different test split, yet many ML pipelines proceed to production based solely on the point estimate.
Meet reliably-metrics: Evaluation With Built-in Statistical Rigor
The reliably-metrics library eliminates guesswork by providing confidence intervals for every metric automatically. Installation is straightforward:
pip install reliably-metricsOnce installed, teams can evaluate models with minimal code:
import reliably as rb
report = rb.evaluate(y_true, y_prob)
print(report.summary())The output includes metric values paired with 95% confidence intervals:
Report(task=binary, n=500)
ECE=0.0412 [0.0287, 0.0541]
smECE=0.0389 [0.0261, 0.0523]
Brier=0.1834 [0.1612, 0.2063]
NLL=0.4821 [0.4503, 0.5148]
AUROC=0.8234 [0.7941, 0.8509]No manual bootstrap sampling, no statistical toolkits, and no extra complexity—just reliable numbers teams can trust.
Deciding Between Models: Statistical Significance Testing
Traditional model comparisons often default to choosing the higher metric value without considering uncertainty. The reliably-metrics library introduces statistical significance testing to provide clear guidance:
result = rb.compare(
model_a,
model_b,
metric="auroc",
y_true=y_true
)
print(f"Delta: {result.delta:+.4f}")
print(f"95% CI: [{result.ci.low:.4f}, {result.ci.high:.4f}]")
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")Sample output might reveal:
Delta: +0.0182
95% CI: [-0.0031, 0.0396]
p-value: 0.094
Significant: FalseThe confidence interval crossing zero and p-value exceeding 0.05 indicate the improvement isn’t statistically significant. This data-driven insight prevents premature deployments of models that may not offer real advantages.
Calibration Matters: Detecting and Fixing Misaligned Probabilities
A model’s predictive confidence should reflect reality. If a model predicts 90% probability, it should be correct roughly 90% of the time. Yet many production systems fail this basic test, leading to overconfident or underconfident decisions.
Diagnosing calibration issues
report_before = rb.evaluate(y_true, y_prob)
print(report_before["ECE"])Output might show:
ECE=0.0821 [0.0612, 0.1034]This Expected Calibration Error (ECE) of 0.0821 suggests the model’s confidence levels are misaligned with actual outcomes.
Recalibrating probabilities
The library supports multiple recalibration methods:
- Temperature scaling
- Isotonic regression
- Platt scaling
- Beta calibration
- Histogram binning
- Vector scaling
- Matrix scaling
cal = rb.recalibrate(
y_true,
y_prob,
method="temperature"
)
y_prob_cal = cal.predict(y_prob_test)Verifying improvements
report_after = rb.evaluate(y_true_test, y_prob_cal)
print(report_after["ECE"])Output may show:
ECE=0.0241 [0.0143, 0.0352]The recalibration reduced ECE from 0.0821 to 0.0241, improving alignment between predicted and actual probabilities.
Visualizing Uncertainty With Reliability Diagrams
Standard calibration plots show a line without context, forcing teams to interpret gaps subjectively. reliably-metrics enhances this with confidence bands that distinguish real calibration errors from random variations.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 6))
report.reliability_diagram(
y_true,
y_prob,
ax=ax,
band=True
)
plt.savefig("calibration.png", dpi=150)The shaded regions represent bootstrap confidence bands around the calibration curve, providing clearer insights into where the model’s confidence matches reality.
Generating Shareable Reports in Seconds
Collaboration often requires translating technical evaluations into accessible formats for stakeholders. reliably-metrics streamlines this process:
report.to_html(path="model_report.html")The generated HTML report includes:
- Metric values and confidence intervals
- Calibration analysis
- Reliability diagrams with uncertainty bands
- Statistical comparison results
- Visualizations
This eliminates the need for manual report compilation or Jupyter notebook exports.
Design Principles Behind Reliably-Metrics
The library prioritizes performance, reproducibility, and accessibility through several key design choices:
1. Modular Dependency Management
Core functionality installs with minimal dependencies:
pip install reliably-metricsOptional features are available through extras:
- Visualization support:
pip install reliably-metrics[viz] - HTML reporting:
pip install reliably-metrics[report] - All features:
pip install reliably-metrics[all]
Heavy dependencies load only when needed, reducing installation complexity.
2. Vectorized Bootstrap for Speed
Traditional bootstrap implementations use Python loops, slowing down computations. reliably-metrics generates all bootstrap indices upfront and performs calculations using vectorized NumPy operations, resulting in faster execution and better scalability.
3. Deterministic Operations for Reproducibility
Every stochastic operation supports explicit seeding to ensure consistent results:
report = rb.evaluate(y_true, y_prob, seed=42)This guarantees the same input data and seed will always produce identical outputs, critical for reproducibility in research and production.
4. Statistically Verified Confidence Intervals
The library’s test suite validates statistical rigor by generating synthetic datasets with known ground-truth metrics. Continuous integration checks that nominal 95% confidence intervals cover the true value approximately 95% of the time, ensuring reliability beyond theoretical claims.
Extending to Advanced Use Cases: Disentanglement Metrics
For researchers working with generative models, VAEs, or self-supervised learning, reliably-metrics includes disentanglement evaluation metrics with confidence intervals:
from reliably.repr import disentanglement
results = disentanglement(
z,
factors,
metrics=(
"mig",
"sap",
"dci",
"factorvae",
"irs"
)
)
print(results["mig"])Output might show:
MIG=0.312 [0.271, 0.354]Supported metrics include MIG, SAP, DCI, FactorVAE Score, and IRS, all reported with bootstrap confidence intervals for rigorous evaluation.
The Future of ML Evaluation: Moving Beyond Point Estimates
The days of shipping models based solely on AUROC or accuracy values are numbered. As ML systems grow in complexity and stakes, so does the need for statistically rigorous evaluation. Tools like reliably-metrics provide the missing layer of rigor that teams require to make informed deployment decisions, reducing the risk of deploying unreliable models into production environments.
AI summary
Learn why single-figure metrics like AUROC are unreliable for ML deployments and how statistically rigorous evaluation with confidence intervals prevents costly mistakes.