GBIM recently overhauled its monitoring stack to align technical telemetry with measurable business outcomes. The team focused on three core pillars: custom Prometheus metrics tied to critical workflows, reliable end-to-end correlation IDs for request tracing, and automated k6 smoke tests that feed performance data directly into Grafana dashboards. While frontend analytics now tracks user activities via GA4, the primary evidence for CPL 6 compliance rests on Prometheus, Grafana, k6, and structured request logs.
The Observability Gap Before Implementation
Before these changes, GBIM’s monitoring stack—built on Prometheus and Grafana—lacked the depth required to answer key operational questions. The team identified several critical gaps:
- HTTP metrics dominated dashboards, offering little visibility into business outcomes like successful registrations or account activations.
- k6 tests existed but lacked a consistent pipeline to execute and push results to Prometheus via remote write.
- Tracking failed requests between frontend and backend proved difficult due to inconsistent correlation ID propagation, validation, and response inclusion.
- User actions like registrations, account verifications, and application status updates lacked explicit monitoring signals.
Teams struggled to answer basic questions: How many user registrations fail due to validation errors? Do account activation failures correlate with token expiration or rate limiting? Without clear signals, debugging operational issues became a reactive process.
What Was Actually Delivered: Scope Clarity
The team documented precise boundaries to ensure claims about observability improvements aligned with actual implementation. Completed deliverables included:
- Backend exposing a
/api/metricsendpoint with custom Prometheus metrics prefixed withgbm_*. - Prometheus and Grafana configured to scrape backend metrics and k6 telemetry.
- A k6 smoke test script packaged as a Kubernetes Job, configured to send metrics to Prometheus.
- Frontend sending
X-Correlation-IDheaders, backend validating or generating UUIDs, and returning the same ID in responses. - Backend logs enriched with
corr_idfields for structured tracing. - Frontend analytics helper for GA4, restricted to approved hosts and environments.
The team avoided overpromising by explicitly excluding unvalidated claims, such as undocumented log aggregation or untested GA4 event flows.
Three Pillars of the New Observability Stack
The solution centered on three interconnected components, each addressing a specific observability need.
1. Custom Prometheus Metrics for Business-Critical Workflows
The backend introduced custom Prometheus metrics in monitoring/metrics.py to track business outcomes. Metrics were incremented for authentication flows, account activations, admin verifications, and application status updates. Key metrics included:
gbm_auth_register_total{role,outcome}gbm_auth_activation_total{outcome}gbm_auth_reactivation_total{outcome}gbm_auth_email_send_duration_seconds{event,outcome}gbm_admin_account_verification_total{action,outcome}gbm_pengajuan_admin_status_update_total{status,outcome}
Additional domains like document uploads and service applications received specialized metrics. These metrics enabled dashboards in Grafana to answer business questions:
- How many registrations succeed versus fail validation?
- Do account activations frequently fail due to invalid tokens, expiration, or rate limits?
- Which steps in admin verification result in failures—list views, detail checks, or status updates?
- Why do application status updates fail—validation errors, missing data, or downstream service issues?
2. End-to-End Correlation IDs for Request Tracing
To trace requests from frontend to backend, the team implemented a correlation ID system. Frontend code in lib/api.ts added the X-Correlation-ID header to all API requests, including token refresh calls. The backend’s CorrelationIdMiddleware handled the following logic:
- Read the
X-Correlation-IDheader from incoming requests. - Accept valid UUID values or generate new ones for missing or invalid IDs.
- Store the correlation ID in the request context.
- Return the same ID in the
X-Correlation-IDresponse header.
Backend logs were enriched using CorrelationIdFilter, embedding the corr_id field in every log line. When errors occurred, frontend developers could use the response correlation ID to locate the exact backend logs for debugging.
3. Automated k6 Smoke Tests with Prometheus Integration
The team leveraged k6 to generate synthetic telemetry that mirrored real user interactions. The implementation included:
- A k6 script (
k6/monitoring-smoke.js) defining smoke tests for critical endpoints. - A Kubernetes Job (
k8s/job/k6-monitoring-smoke.yaml) to execute tests on a schedule. - Configuration to send metrics to Prometheus using the
experimental-prometheus-rwoutput mode. - Remote write endpoint at `
- Metric tagging with
testid=monitoring-smokefor easy filtering in Grafana.
Prometheus was configured with --web.enable-remote-write-receiver to accept streaming metrics. The smoke tests targeted key endpoints:
/api/monitoring/health//api/metrics/api/auth/activation/?token=...(testing invalid and rate-limited tokens)- Optional registration flow for department heads if
ENABLE_REGISTER_FLOW=true
Each test included X-Correlation-ID headers and X-Forwarded-Proto: https to ensure synthetic requests followed real-world patterns, including SSL termination and reverse proxy configurations.
Frontend Analytics as a Supplementary Signal
While backend metrics formed the core of observability, the frontend added GA4 event tracking for user activities. The lib/analytics.ts helper sent events only when:
NEXT_PUBLIC_GA_MEASUREMENT_IDwas defined.NEXT_PUBLIC_APP_ENVwas set tostagingorproduction.- The runtime host matched an allowlist (e.g.,
gbim-staging.ppl.cs.ui.ac.id).
Instrumented events included:
- Registration attempts (
register_submitted,register_success,register_failed) - Account activation statuses (
activation_verified,activation_expired,activation_used) - Admin verification actions (
admin_verification_list_viewed,admin_verification_detail_clicked) - Application status updates (
pengajuan_admin_status_updated)
A critical detail: Next.js reads NEXT_PUBLIC_* environment variables at build time, so updating these variables required rebuilding and redeploying the frontend. Without this step, GA4 tags would not appear in the staging bundle, even if the analytics code was present.
Looking Ahead: From Monitoring to Actionable Insights
The GBIM team’s observability improvements mark a shift from reactive troubleshooting to proactive insights. By integrating business metrics, end-to-end tracing, and synthetic testing, teams can now correlate technical failures with user-facing outcomes. Future work may include expanding metric granularity, refining alerting rules, and integrating observability into CI/CD pipelines to catch regressions before deployment. The foundation is set—now the focus turns to leveraging these signals to drive continuous improvement.
AI summary
GBIM projesi, Prometheus özel metrikleri, uçtan uca correlation ID ve k6 smoke testleriyle gözlemlenebilirliği nasıl güçlendirdi? İş odaklı metriklere nasıl ulaşıldı ve performans izleme nasıl otomatikleştirildi?