iToverDose/Software· 11 JUNE 2026 · 16:07

Track Video Engagement with HyperLogLog in PostgreSQL Heatmaps

Discover how TopVideoHub replaced million-row event tables with compact HyperLogLog sketches to generate accurate video heatmaps without the storage or query overhead. Learn the ingestion pipeline and SQL optimizations that make it work at scale.

DEV Community4 min read0 Comments

A single viral video can generate half a million watch sessions in a matter of hours, but most of those views are concentrated on just a few seconds. Product teams want to know which moments hold attention and how many distinct viewers each moment attracts. Traditional approaches inflate storage with millions of rows and slow down every analytics query with expensive COUNT(DISTINCT) operations. At TopVideoHub, a mid-size Asia-Pacific video aggregator, engineers found a better way by trading exact counts for fast, compact approximations using HyperLogLog inside PostgreSQL.

From event floods to sketch-based analytics

Early attempts created a watch_events table with one row per (user, video, second). A 9-minute clip watched in full by one visitor ballooned into 540 rows. Extend that across thousands of videos and a few hundred thousand daily sessions, and the table grew by hundreds of millions of rows per day. While the table answered detailed questions, it became a storage and performance liability. Every time a product manager opened the analytics dashboard, the system ran COUNT(DISTINCT user_id) GROUP BY second, scanning entire partitions just to return approximate insights.

Recognizing that “41,000 unique viewers saw the hook at 0:08” is just as actionable as “41,287,” TopVideoHub adopted HyperLogLog, a probabilistic data structure that estimates set cardinality using a fixed memory footprint. The decision to use HyperLogLog was driven by three requirements: constant storage per bucket, the ability to merge sketches from different regions, and tunable accuracy. The team chose the postgresql-hll extension from Citus/Aggregate Knowledge, a battle-tested implementation already embedded in many PostgreSQL deployments.

How HyperLogLog powers fast, scalable heatmaps

HyperLogLog estimates cardinality by hashing each element and tracking the longest run of leading zeros across multiple registers, then applying a harmonic mean to produce an unbiased estimate. For video heatmaps, the key properties are:

  • Fixed size: A sketch with log2m=12 (4096 registers) consumes about 2.5 KB regardless of whether it represents one viewer or fifty million.
  • Unionable: The union of two HyperLogLog sketches produces a new sketch representing the combined set, enabling region-level or global heatmaps without rescanning raw events.
  • Tunable error: At log2m=12, the standard error is around 1.6%, imperceptible in heatmap visualizations.

The team modeled the heatmap as a separate analytics store in PostgreSQL, storing one sketch per (video_id, region, bucket_sec) tuple. The schema uses the hll type to keep the serialized sketch directly in each row, minimizing storage while enabling fast reads.

CREATE TABLE video_heatmap (
  video_id    TEXT NOT NULL,
  region      TEXT NOT NULL,
  bucket_sec  INTEGER NOT NULL,
  viewers     HLL NOT NULL,
  PRIMARY KEY (video_id, region, bucket_sec)
);

CREATE INDEX video_heatmap_lookup
  ON video_heatmap (video_id, region, bucket_sec) INCLUDE (viewers);

PostgreSQL’s hll extension provides both aggregate (hll_add_agg, hll_union_agg) and scalar functions (hll_add, hll_cardinality, || for union) to manipulate sketches directly within SQL.

Ingesting watch progress without the row explosion

Instead of storing every heartbeat event in a table, the player sends anonymous heartbeats every few seconds containing the range of seconds the viewer watched since the last beat. The ingestion pipeline hashes each viewer identifier to a 64-bit integer, then folds the range directly into the HyperLogLog sketches using a batched upsert.

import hashlib
import os
import psycopg

DB = os.environ["ANALYTICS_DSN"]
VIEWER_SALT = os.environ["VIEWER_SALT"].encode()

def viewer_hash(viewer_id: str) -> int:
    digest = hashlib.blake2b(
        viewer_id.encode() + VIEWER_SALT,
        digest_size=8
    ).digest()
    return int.from_bytes(digest, "big", signed=True)

def record_watch(
    conn,
    video_id: str,
    region: str,
    start_sec: int,
    end_sec: int,
    viewer_id: str
) -> None:
    h = viewer_hash(viewer_id)
    with conn.cursor() as cur:
        cur.execute(
            """
            INSERT INTO video_heatmap (video_id, region, bucket_sec, viewers)
            SELECT %(vid)s, %(region)s, g.b,
                   hll_add(hll_empty(), hll_hash_bigint(%(h)s))
            FROM generate_series(%(start)s, %(end)s) AS g(b)
            ON CONFLICT (video_id, region, bucket_sec)
            DO UPDATE SET viewers = video_heatmap.viewers || hll_add(hll_empty(), hll_hash_bigint(%(h)s));
            """,
            {"vid": video_id, "region": region, "h": h, "start": start_sec, "end": end_sec},
        )

The worker consumes heartbeats from a queue, batches them, and applies the sketch update in a single statement per viewer range. By expanding the watched range with generate_series, each batch efficiently writes the viewer’s presence across every second they watched without ever storing individual events.

Behind the scenes, TopVideoHub’s main application runs on PHP 8.4 behind LiteSpeed and Cloudflare, while search uses SQLite FTS5. The analytics workload is delegated to a dedicated PostgreSQL instance where HyperLogLog keeps storage growth linear and queries sub-millisecond.

Building a global heatmap in real time

With sketches stored per (video, region, second), generating a heatmap is as simple as running hll_cardinality(viewers) per bucket and stacking the results. Want a regional breakdown? Union sketches across regions. Need a coarser view? Union adjacent buckets to go from per-second to per-10-second resolution. The entire pipeline scales with the size of the sketch, not the volume of events, making it practical to run global heatmaps across millions of daily sessions without provisioning extra compute or storage.

For product teams, the shift from exact counts to approximate but fast insights unlocked real-time experimentation and rapid iteration on video curation. The architecture demonstrates how embracing probabilistic data structures can turn an intractable storage problem into a maintainable, performant one.

AI summary

PostgreSQL'in HyperLogLog uzantısını kullanarak video izleme verilerini optimize edin. Milyonlarca kullanıcı verisini nasıl daha hızlı ve verimli analiz edebilirsiniz? Detaylar için tıklayın.

Comments

00
LEAVE A COMMENT
ID #XNID5H

0 / 1200 CHARACTERS

Human check

2 + 9 = ?

Will appear after editor review

Moderation · Spam protection active

No approved comments yet. Be first.