Technical whitepaper — PodHeitor Global Deduplication for Bacula

A standalone global dedup engine that operates as a Storage Daemon sidecar, with variable-length chunking (FastCDC), multi-layer Bloom filter, RocksDB index, client-side dedup via FD plugin, and Prometheus metrics — competing directly with Bacula Enterprise GED and Commvault DDB.

Companion technical document to the PodHeitor Global Dedup plugin page.

1. The problem: Bacula Enterprise GED is fixed-block + TokyoCabinet

The GED (Global Endpoint Deduplication) in Bacula Enterprise uses fixed-size chunking and a TokyoCabinet B-tree as the hash DB. That has three structural limitations:

  • Cascade effect on modifications. Inserting 1 byte at the start of a 10 GB file invalidates every subsequent block — fixed-block boundaries shift. Result: dedup ratio collapses on log workloads, DB dumps and VM snapshots.
  • SSD-bound on IOPS. Every TokyoCabinet lookup is an SSD seek. Without a Bloom filter in front, the hot path touches disk per block.
  • No client-side dedup. Every byte crosses the FD→SD link before dedup runs. On remote backup, that wastes bandwidth.

PodHeitor GDD attacks all three: FastCDC variable-length chunking, Bloom L1 + RocksDB L2, and client-side fingerprinting via FD plugin before payload transmits.

2. Architecture — SD sidecar + client-side FD plugin

┌─────────────────────────────────────────────────────────────────────────────┐
│                      BACULA INFRASTRUCTURE (UNMODIFIED)                      │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌──────────┐        ┌──────────────┐        ┌──────────────────────────┐   │
│  │  Bacula  │  TCP   │  Bacula FD   │  TCP   │     Bacula SD            │   │
│  │ Director │◄──────►│ + PodHeitor  │◄──────►│ + PodHeitor SD Plugin    │   │
│  │          │        │  FD Plugin   │        │   (Rust cdylib v2)       │   │
│  └──────────┘        └──────┬───────┘        └───────────┬──────────────┘   │
│                              │ Fingerprints              │ Data stream       │
│                              ▼                            ▼                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                       PODHEITOR GDD ENGINE (RUST)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌───────────────┐  ┌──────────────────────────┐           │
│  │  FastCDC    │  │ Bloom Filter  │  │   Dedup Index            │           │
│  │  Chunker    │──►│  (L1 Cache)  │──►│  (RocksDB + RAM)         │           │
│  │  (variable) │  │  FPR: 0.1%   │  │  SHA-256 → ContainerRef  │           │
│  └─────────────┘  └───────────────┘  └──────────────────────────┘           │
│         │                                        │                          │
│         ▼                                        ▼                          │
│  ┌─────────────┐                     ┌──────────────────────────┐           │
│  │  Segment    │                     │   Container Store        │           │
│  │  Locality   │────────────────────►│  (append-only files)     │           │
│  │  Tracker    │                     │  + segment grouping      │           │
│  └─────────────┘                     └──────────────────────────┘           │
└─────────────────────────────────────────────────────────────────────────────┘

The FD plugin is a Rust cdylib (workspace plugin-gdd). The SD plugin is Rust cdylib + FFI binding (libgdd_ffi_sd.so) that talks to the GDD daemon via TCP+PSK on port 9104, with Prometheus metrics on 9421. Zero modification to Bacula source.

3. FastCDC — variable-length chunking

FastCDC (Fast Content-Defined Chunking) uses a gear table-based rolling hash to find content-deterministic boundaries:

Parameter Default Range
Min chunk 4 KB 2–8 KB
Avg chunk 16 KB 8–64 KB
Max chunk 64 KB 32–256 KB
Hash bits 48 fixed

Why variable-length over fixed-size? On modifications that insert or remove bytes, fixed-size desyncs everything past the insertion point (cascade effect). Variable-length re-finds the natural boundary after a few bytes — 50–70% better ratio on DB dumps, logs, VM snapshots.

4. Bloom L1 — SSD I/O elimination

Before any RocksDB query (expensive on SSD), the in-RAM Bloom filter answers “definitely new” or “maybe exists”. Counting Bloom Filter with partitioned hashing:

Parameter Value Rationale
Expected items 100M chunks ~1.6 TB unique data at 16 KB avg
FPR target 0.1% (1 in 1000) Acceptable
Bits per item 14.4 Optimal for 0.1% FPR
Total memory ~180 MB Fits in RAM easily
Hash functions 10 k = ln(2) × m/n
Persistence mmap’d file Instant load, crash-safe

Measured result: 90%+ SSD I/O reduction on the hot path vs direct DB lookup.

5. Two operating modes

Mode Behaviour When to use
Mode A SD plugin transparent; dedup happens server-side; FD untouched Migration path / legacy clients that can’t change FD
Mode B FD plugin computes fingerprint client-side; only new blocks cross the link Production default — 60–95% bandwidth saving

In Mode B, the FD reads the file, FastCDC chunks it, generates a SHA-256 fingerprint per chunk, sends the list to the GDD daemon. The daemon answers “have it” or “need it”. The FD only transmits what’s new.

6. Reliability — fail-closed mandatory

The Mode B lab went through a documented bug sequence (B3a/b/c/d, B4, B5) that taught one operational principle: fail-closed > fail-open. In particular, any error path that leaves state->active=false with sp->no_read=false silently emits 0-byte file records with Termination: Backup OK — the worst possible failure in backup. Production fixes 2026-04-29 → 2026-05-01:

Bug Cause Fix
B3c (daemon-restart mid-job) Stale state after reconnect Fail-closed JOB_START + per-file passthrough fallback (with lseek(fd, 0, SEEK_SET) before reusing fd)
B3a (FD kill) Job stayed in status I forever Auto-reschedule by Director
B3b (SD kill) Partial volume undetected Fatal status; recovery via re-run
B5 (cargo 0-byte) VREF build 24.6 MB exceeded STORE_NEW 16 MiB cap FDRAW passthrough on three fallback paths
Foreign-plugin guard Plugin engaged on ADABAS jobs SELF_PREFIX gate on PluginCommand

7. Garbage collection — concurrent mark-sweep

The GDD daemon runs concurrent vacuum in the background: mark-sweep over the index, refcount drop on containers, compaction of container holes. Unlike BEE GED (maintenance via scheduled “DDB maintenance job”), PodHeitor vacuum:

  • Runs at low priority, no maintenance window required.
  • Doesn’t block live jobs (read+write on append-only containers).
  • Reports progress via /metrics Prometheus.

8. BR/DR — chunk-store recovery

The BR_DR_RUNBOOK.md documents cold-stop tar + LVM/ZFS snapshot + soft-loss recovery (delete bloom/esm, daemon rebuilds). Caveats:

  • gdd-fsck --rebuild-index not yet implemented (post-GA item).
  • manifests/ empty in current state — write path under verification.
  • Live-snapshot API not available; use LVM/ZFS at the host level.

9. Prometheus metrics

The /metrics endpoint on port 9421 exposes:

  • gdd_chunks_total, gdd_chunks_unique, gdd_chunks_duplicate
  • gdd_bytes_in, gdd_bytes_stored, derived ratio
  • gdd_bloom_hits, gdd_bloom_false_positives
  • gdd_rocksdb_get_seconds_bucket (histogram)
  • gdd_vacuum_cycles_total, gdd_vacuum_bytes_reclaimed
  • gdd_daemon_rss_bytes (target ≤ 200 MB sustained)

10. Documented anti-patterns

  • Don’t dedup single files > 134 MiB cap without FDRAW passthrough. The VREF protocol cap is 16 MiB; large files go via OVERSIZE-PASSTHROUGH. The lab proved this with a 12 GB randombig.bin crossing volume rollover.
  • Don’t confuse volume size mismatch with a plugin bug. Differences with abs(N-M) < 65536 are Bacula core self-heal — operator runbook B1.
  • Don’t run concurrent ops on the same container without the daemon active. The daemon is the single source of truth for refcount.
  • Don’t kill the daemon mid-job without accepting fail-closed on the current job. Intentional: fail-closed by design vs silent 0-byte bug.

11. License posture

The plugin ships under LicenseRef-PodHeitor-Proprietary. No Bacula AGPLv3 source is statically linked. The FD cdylib, SD cdylib and daemon are all pure Rust binaries. SD integration uses only the public sd_plugins.h Bacula API via extern "C".

Ready to evaluate?

Free 30-day trial for multi-TB workloads with expected dedup ratio > 4×. We guarantee at least 50% discount vs Bacula Enterprise GED, Veeam Repository or Commvault DDB, with more features included.

Heitor Faria — Founder, PodHeitor International
[email protected]
☎ +1 (789) 726-1749 · +55 (61) 98268-4220 (WhatsApp)
🔗 PodHeitor Global Dedup plugin page

Disponível em: pt-brPortuguês (Portuguese (Brazil))enEnglishesEspañol (Spanish)

Leave a Reply