43 Bytes per Series: How I Compressed OTLP labels with Packed KeySets

Jan 18, 2026

Article voiceover

0:00

-4:51

When you ingest OTLP metrics, every datapoint carries a labelset (metric name plus label pairs). The LabelSetStore is the hot path that maps a canonical labelset to a SeriesRef and deduplicates the series. The storage layout you choose here determines both memory footprint and ingestion CPU.

In the previous artilce we discussed SymbolTable

In this one we build different versions of LabelSetStore. This work is inspired by https://habr.com/ru/companies/flant/articles/878282/ (in Russian).

This article walks through three LabelSetStore implementations in Chronoxide, plus two sealed snapshots of the KeySet store for maximum density:

NaiveLabelSetStore – intentionally inefficient, uses owned strings per series.
FlatInternedLabelSetStore – interns keys/values and stores label pairs in a flat arena.
KeySetDictEncodedLabelSetStore – groups by keyset and dictionary-encodes values.
FixedWidthPackedKeySetLabelSetStore – read-only, byte-aligned packing (1/2/4 bytes per key).
BitPackedKeySetLabelSetStore – read-only, bit-packed storage for maximum compression.

TL;DR

Naive is easy to reason about, but it explodes memory and allocator pressure.
FlatInterned is faster and far more memory efficient with almost no fragmentation.
KeySetDictEncoded minimizes memory by sharing keys and dictionary-encoding values.
FixedWidthPackedKeySet and BitPackedKeySet (sealed, read-only snapshots produced at report time) win on memory: ~67/52 bytes per series (Allocated/Used) for fixed-width, and ~58/43 for bit-packed (vs ~233/210 for FlatInterned).

Baseline: NaiveLabelSetStore

The naive store keeps each labelset as its own vector of owned strings:

Vec<Vec<OwnedKeyValue>>
OwnedKeyValue { key: String, value: String }

Each series allocates:

a separate Vec header (ptr/len/cap),
per-label heap allocations for String keys and values,
and hash map bookkeeping for series lookup.

This is correct but hostile to memory: millions of small allocations amplify allocator overhead and internal fragmentation.

Memory blowup in practice

On a 400k-message ingest run, RSS rises to ~35 GiB in under a minute:

This uses the same capture as the 11M-message workload below; the chart shows the first ~400k messages.

This is not a pathological input. The naive layout stores owned strings per series, and its encode path allocates OwnedKeyValue (new String key/value) before it can even check whether the labelset already exists. That means every intern attempt allocates, even on cache hits, which amplifies allocator churn and keeps RSS high.

Criterion results

Intern unique: 15.042 ms
Visit 50k: 331.69 us

Naive allocation profile (100k series)

TrackingAllocator output:

Alloc Calls: 1,100,017
Req Current: 38,573,968B
Usable Current: 55,782,208B
Internal Fragmentation: 30.85%
Estimate Used Bytes: 37,200,128B

Why we need a better layout

Naive storage is correct but too expensive. We need to:

avoid per-series Vec allocations,
intern repeated strings instead of storing String for every series,
and ideally reuse keysets to compress labelsets further.

Improved: FlatInternedLabelSetStore

FlatInternedLabelSetStore fixes the two big problems:

It interns keys and values using SymbolTable, so repeated strings are stored once.
It stores all label pairs in a single flat Vec<InternedKeyValue>, with per-series offsets (SeriesLoc) pointing to slices inside the flat array.

This removes the per-series Vec and per-string heap allocations, and it preserves fast labelset reads by slicing the flat array.

It is a drop-in performance win without changing query semantics. See the comparison section for the numbers.

Maximum compression: KeySetDictEncodedLabelSetStore

KeySetDictEncodedLabelSetStore takes the memory optimization further by separating keys from values and dictionary-encoding values per key (global across keysets):

A keyset is a sorted list of label keys in canonical order (sorted by key string).
For each key, we keep a value dictionary (global across keysets): SymbolId -> ValueCode.
Each series row stores only ValueCode entries, one per key in the keyset.

This means:

keys are stored once per keyset,
values are stored once per key dictionary,
series rows are dense arrays of compact codes.

Note on normalization: the encode path uses normalize_label_key/normalize_label_value, which return a Cow and only allocate a truncated copy when the label exceeds limits; in-range labels are borrowed, so normalization is allocation-free in the common case (interning still allocates for new symbols).

It is highly effective when you have many series that share the same keyset and repeated values. It produces the smallest memory footprint among the mutable stores in the experiment.

To minimize memory further, this store supports two sealed read-only layouts:

FixedWidthPackedKeySetLabelSetStore stores ValueCode entries in byte-aligned widths (0/1/2/4 bytes), chosen per key based on the max code observed for that key in each keyset block. Rows remain directly indexable, so this is a good balance of speed and memory.
BitPackedKeySetLabelSetStore stores ValueCode entries in a bitstream using the exact number of bits per key. This removes the last few bytes of overhead at the cost of bit-level unpacking on reads.

Both are snapshots of the mutable KeySet store (the vectors are shrunk to fit), so they are immutable and efficient for scan-heavy workloads. In a typical TSDB architecture, these sealed stores are created during Block Compaction or when flushing a completed block to disk/storage. They are not intended for the active ingestion path, but for the immutable historical blocks.

This is the game changer. On the 11M-message workload, the unpacked KeySet store uses ~118.75 bytes per series (Used). Fixed-width packing drops that to ~52.06 bytes, and bit-packing pushes it to ~43.07 bytes per series.

Visualization

To see how the “Keyset -> Dictionary -> Row” structure looks in practice, here is a dump of a small store with 3 series.

Notice how namespace and pod (which have higher cardinality) reuse values via codes 0 and 1, while __name__ is stored just once in the keyset and has a single-entry dictionary.

KeySetLabelSetStore
  series=3 keysets=1 value_dicts=5 sum_per_key_cardinality=7 symbols=12
  estimate_size_bytes=2300 estimate_used_bytes=1474
Symbols (first 200):
  SymbolId(0) "__name__"
  SymbolId(1) "pod_cpu_usage_seconds_total"
  SymbolId(2) "cluster"
  SymbolId(3) "prod"
  SymbolId(4) "container"
  SymbolId(5) "web"
  SymbolId(6) "namespace"
  SymbolId(7) "payments"
  SymbolId(8) "pod"
  SymbolId(9) "backend-123"
  SymbolId(10) "backend-1231"
  SymbolId(11) "payments2"
KeySets (first 200):
  KeySetId(0): [SymbolId(0)="__name__", SymbolId(2)="cluster", SymbolId(4)="container", SymbolId(6)="namespace", SymbolId(8)="pod"]
Value Dictionaries (first 200):
  Key SymbolId(0)="__name__": cardinality=1
    ValueCode(0) -> SymbolId(1) "pod_cpu_usage_seconds_total"
  ...
  Key SymbolId(6)="namespace": cardinality=2
    ValueCode(0) -> SymbolId(7) "payments"
    ValueCode(1) -> SymbolId(11) "payments2"
  Key SymbolId(8)="pod": cardinality=2
    ValueCode(0) -> SymbolId(9) "backend-123"
    ValueCode(1) -> SymbolId(10) "backend-1231"
Rows per KeySet (first 200):
  KeySetId(0): key_count=5 rows=3
    row 0: "__name__"="pod_cpu_usage_seconds_total", ... "pod"="backend-123"
    row 1: "__name__"="pod_cpu_usage_seconds_total", ... "pod"="backend-1231"
    row 2: "__name__"="pod_cpu_usage_seconds_total", ... "pod"="backend-1231"
Series (first 200):
  SeriesRef(0): KeySetId(0) row=0
  SeriesRef(1): KeySetId(0) row=1
  SeriesRef(2): KeySetId(0) row=2

The tradeoff is CPU on reads. To reconstruct a labelset from a SeriesRef, the store must:

Fetch the Keyset: Resolve KeySetId to the list of SymbolId keys.
Fetch the Row: Retrieve the ValueCode entries for this series.
Per-label resolve: For each key, hash lookup the per-key dictionary, map ValueCode -> SymbolId, and resolve both key and value SymbolIds back to strings via the SymbolTable.

This multi-step path (per-key hash lookup + dictionary indirection + two symbol resolves) explains why visit_labelset is ~8x slower than FlatInterned in the benchmarks (2081us vs 262us). The packed variants add further overhead due to unpacking on reads.

FlatInterned walks one contiguous (key,value) array and performs two symbol resolves per label, which plays nicely with the hardware prefetcher. KeySet still scans contiguous keyset + row arrays, but it adds a per-label HashMap lookup (value_dicts.get(&key)) plus an extra dictionary indirection before the same two symbol resolves.
That per-label hash lookup + extra indirection is the main cache-miss risk and explains the read slowdown, especially when traversing many random series. The packed variants add further overhead (~6-17%) due to the byte- or bit-level unpacking instructions required on the read path.

Benchmarking and allocator analysis

Criterion results

The PackedKeySet visit time (2211.4 us) is ~7% slower than the unpacked version (2073.4 us). This delta represents the pure CPU cost of bit-unpacking the values. However, both KeySet variants are significantly slower than FlatInterned (~258 us) due to the dictionary lookups. This confirms that while bit-packing adds a small CPU tax, the primary latency cost comes from the dictionary structure itself.

PackedKeySet numbers come from sealing the KeySet store at report time. This is a read-only snapshot, not an ingestion-time layout.

Allocation and fragmentation (100k series)

TrackingAllocator output:

Results on 11 million OTLP messages

Workload summary

These results are from 11,376,766 OTLP messages captured over a 00h:54m window and replayed from /tmp (RAM-backed) to minimize storage I/O:

Sum per-key cardinality is the sum of per-key dictionary sizes across all keys (values counted once per key). Global distinct values is the number of unique values across all keys.

RSS comparison across stores

RSS over time for FlatInterned and KeySetDictEncoded stores (same workload, same host):

Latency on real workload

DP Intern is a per-message average time per datapoint spent in labelset interning.

`/usr/bin/time -pv`

End-of-run stats from /usr/bin/time -pv (pinned to CPU cores 10-16):

Store statistics

Store size from the Markdown reports:

These statistics confirm that dictionary encoding and packing deliver massive memory savings on real-world datasets.

BitPackedKeySet is the clear winner for density, requiring only ~58 bytes per series (Allocated) or ~43 bytes per series (Used), which is a ~4x reduction compared to FlatInternedLabelSetStore (~233/210 bytes).

FixedWidth already gets you to ~67/52 bytes per series, while the unpacked KeySetDictEncoded layout lands at ~176/119 bytes.

Comparison at a Glance

Rust implementation notes

Normalization uses Cow to avoid allocation on in-range labels; only truncation allocates.
Keysets are stored as Arc<[SymbolId]> for deduplication and cheap clones between tables and snapshots.
The core layouts use flat Vec arenas plus compact u32 ids/codes (SeriesRef, SymbolId, ValueCode) to reduce footprint.
Sealed snapshots call shrink_to_fit to drop unused capacity before measuring/packing.
U64HashMap uses a custom U64IdentityHasher (no-op hasher) to avoid double-hashing, as the store pre-hashes labelsets.
Memory estimation logic (estimate_hashmap_table_bytes) is aware of hashbrown / SwissTable control bytes to accurately account for overhead.

Summary

If you need a safe baseline, NaiveLabelSetStore is simple but too expensive for real workloads.

If you want a default that is fast and memory efficient, FlatInternedLabelSetStore is the best balanced choice.

If you are chasing the lowest memory possible and can tolerate slower labelset reads, KeySetDictEncodedLabelSetStore wins on memory by a large margin.

In practice:

Use FlatInterned for ingestion + query hot paths.
Use KeySetDictEncoded for memory-constrained scenarios or background compaction paths.
Seal to FixedWidthPacked or BitPacked when you want a read-only snapshot with maximum density.

Appendix

Bench Environment

Ubuntu 25.10
Kernel 6.17.0-8-generic
CPU: AMD Ryzen 9 9950X (16-core), x86_64
Build flags: -C target-cpu=native (via .cargo/config.toml)
Note: CPU frequency scaling/turbo can shift small deltas; keep clocks stable when comparing close results.

Art’s Substack

Discussion about this post

Ready for more?

Art’s Substack

43 Bytes per Series: How I Compressed OTLP labels with Packed KeySets

TL;DR

Baseline: NaiveLabelSetStore

Memory blowup in practice

Criterion results

Naive allocation profile (100k series)

Why we need a better layout

Improved: FlatInternedLabelSetStore

Maximum compression: KeySetDictEncodedLabelSetStore

Visualization

Benchmarking and allocator analysis

Criterion results

Allocation and fragmentation (100k series)

Results on 11 million OTLP messages

Workload summary

RSS comparison across stores

Latency on real workload

/usr/bin/time -pv

Store statistics

Comparison at a Glance

Rust implementation notes

Summary

Appendix

Bench Environment

Data sources

Discussion about this post

Ready for more?

`/usr/bin/time -pv`