Enhancing GBIM Observability with Business Metrics and End-to-End Tracing

GBIM recently overhauled its monitoring stack to align technical telemetry with measurable business outcomes. The team focused on three core pillars: custom Prometheus metrics tied to critical workflows, reliable end-to-end correlation IDs for request tracing, and automated k6 smoke tests that feed performance data directly into Grafana dashboards. While frontend analytics now tracks user activities via GA4, the primary evidence for CPL 6 compliance rests on Prometheus, Grafana, k6, and structured request logs.

The Observability Gap Before Implementation

Before these changes, GBIM’s monitoring stack—built on Prometheus and Grafana—lacked the depth required to answer key operational questions. The team identified several critical gaps:

HTTP metrics dominated dashboards, offering little visibility into business outcomes like successful registrations or account activations.
k6 tests existed but lacked a consistent pipeline to execute and push results to Prometheus via remote write.
Tracking failed requests between frontend and backend proved difficult due to inconsistent correlation ID propagation, validation, and response inclusion.
User actions like registrations, account verifications, and application status updates lacked explicit monitoring signals.

Teams struggled to answer basic questions: How many user registrations fail due to validation errors? Do account activation failures correlate with token expiration or rate limiting? Without clear signals, debugging operational issues became a reactive process.

What Was Actually Delivered: Scope Clarity

The team documented precise boundaries to ensure claims about observability improvements aligned with actual implementation. Completed deliverables included:

Backend exposing a /api/metrics endpoint with custom Prometheus metrics prefixed with gbm_*.
Prometheus and Grafana configured to scrape backend metrics and k6 telemetry.
A k6 smoke test script packaged as a Kubernetes Job, configured to send metrics to Prometheus.
Frontend sending X-Correlation-ID headers, backend validating or generating UUIDs, and returning the same ID in responses.
Backend logs enriched with corr_id fields for structured tracing.
Frontend analytics helper for GA4, restricted to approved hosts and environments.

The team avoided overpromising by explicitly excluding unvalidated claims, such as undocumented log aggregation or untested GA4 event flows.

Three Pillars of the New Observability Stack

The solution centered on three interconnected components, each addressing a specific observability need.

1. Custom Prometheus Metrics for Business-Critical Workflows

The backend introduced custom Prometheus metrics in monitoring/metrics.py to track business outcomes. Metrics were incremented for authentication flows, account activations, admin verifications, and application status updates. Key metrics included:

gbm_auth_register_total{role,outcome}
gbm_auth_activation_total{outcome}
gbm_auth_reactivation_total{outcome}
gbm_auth_email_send_duration_seconds{event,outcome}
gbm_admin_account_verification_total{action,outcome}
gbm_pengajuan_admin_status_update_total{status,outcome}

Additional domains like document uploads and service applications received specialized metrics. These metrics enabled dashboards in Grafana to answer business questions:

How many registrations succeed versus fail validation?
Do account activations frequently fail due to invalid tokens, expiration, or rate limits?
Which steps in admin verification result in failures—list views, detail checks, or status updates?
Why do application status updates fail—validation errors, missing data, or downstream service issues?

2. End-to-End Correlation IDs for Request Tracing

To trace requests from frontend to backend, the team implemented a correlation ID system. Frontend code in lib/api.ts added the X-Correlation-ID header to all API requests, including token refresh calls. The backend’s CorrelationIdMiddleware handled the following logic:

Read the X-Correlation-ID header from incoming requests.
Accept valid UUID values or generate new ones for missing or invalid IDs.
Store the correlation ID in the request context.
Return the same ID in the X-Correlation-ID response header.

Backend logs were enriched using CorrelationIdFilter, embedding the corr_id field in every log line. When errors occurred, frontend developers could use the response correlation ID to locate the exact backend logs for debugging.

3. Automated k6 Smoke Tests with Prometheus Integration

The team leveraged k6 to generate synthetic telemetry that mirrored real user interactions. The implementation included:

A k6 script (k6/monitoring-smoke.js) defining smoke tests for critical endpoints.
A Kubernetes Job (k8s/job/k6-monitoring-smoke.yaml) to execute tests on a schedule.
Configuration to send metrics to Prometheus using the experimental-prometheus-rw output mode.
Remote write endpoint at `
Metric tagging with testid=monitoring-smoke for easy filtering in Grafana.

Prometheus was configured with --web.enable-remote-write-receiver to accept streaming metrics. The smoke tests targeted key endpoints:

/api/monitoring/health/
/api/metrics
/api/auth/activation/?token=... (testing invalid and rate-limited tokens)
Optional registration flow for department heads if ENABLE_REGISTER_FLOW=true

Each test included X-Correlation-ID headers and X-Forwarded-Proto: https to ensure synthetic requests followed real-world patterns, including SSL termination and reverse proxy configurations.

Frontend Analytics as a Supplementary Signal

While backend metrics formed the core of observability, the frontend added GA4 event tracking for user activities. The lib/analytics.ts helper sent events only when:

NEXT_PUBLIC_GA_MEASUREMENT_ID was defined.
NEXT_PUBLIC_APP_ENV was set to staging or production.
The runtime host matched an allowlist (e.g., gbim-staging.ppl.cs.ui.ac.id).

Instrumented events included:

Registration attempts (register_submitted, register_success, register_failed)
Account activation statuses (activation_verified, activation_expired, activation_used)
Admin verification actions (admin_verification_list_viewed, admin_verification_detail_clicked)
Application status updates (pengajuan_admin_status_updated)

A critical detail: Next.js reads NEXT_PUBLIC_* environment variables at build time, so updating these variables required rebuilding and redeploying the frontend. Without this step, GA4 tags would not appear in the staging bundle, even if the analytics code was present.

Looking Ahead: From Monitoring to Actionable Insights

The GBIM team’s observability improvements mark a shift from reactive troubleshooting to proactive insights. By integrating business metrics, end-to-end tracing, and synthetic testing, teams can now correlate technical failures with user-facing outcomes. Future work may include expanding metric granularity, refining alerting rules, and integrating observability into CI/CD pipelines to catch regressions before deployment. The foundation is set—now the focus turns to leveraging these signals to drive continuous improvement.

AI summary

GBIM projesi, Prometheus özel metrikleri, uçtan uca correlation ID ve k6 smoke testleriyle gözlemlenebilirliği nasıl güçlendirdi? İş odaklı metriklere nasıl ulaşıldı ve performans izleme nasıl otomatikleştirildi?